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Preface 


At the height of climate crisis, the UK strives to maintain its position at the forefront 
of the most rapidly decarbonising countries, harnessing efforts to end domestic coal 
power generation by 2025. The Net-zero initiative is the recent UK contribution to 
stop global warming. Technology has become ubiquitous and this has prompted a 
fundamental shift from large-scale centrally controlled energy market to distribution 
system operators (DSO) taking part in the single-flow energy market. As business 
and homes shift to less energy- and emissions-intensive activities, sustained by the 
emergence of affordable renewable energy, opportunities arise for new businesses 
and new market entries in the energy sector, which has hastened a lot of interest into 
the prediction of individual electric energy demand. With extreme weather events, 
inter-connectivity of modern society and information collection and speed with 
which it propagates, the sector faces mass digital disruption. There will be many 
challenges going forward, but also opportunities, for coming-together scientific 
disciplines to devise new solutions to old and new problems. 

In this book, that grew out of a co-supervision of a Master dissertation in the 
forecasting of individual electric demand, we present central concepts of extreme 
value theory, an area of statistics devoted to studying extreme events. We also list 
currently the most popular prediction algorithms for short-term forecasting that are 
normally dispersed across different research literature coming from mathematics, 
statistics and machine learning. Our main goal is to collect the different concepts 
needed for peak forecasting of individual electric demand, so they require minimal 
background knowledge and to present those concepts with a clear view of the 
assumptions required for their application and their benefits and limitations. 

The structure of the book The introductory chapter provides a description 
of the problem, namely, short-term prediction of electric demand on individual level, 
and motivation behind it. Our focus on peaks is also explained. The two data-sets 
that are used in Chap. 5 to illustrate the concepts presented in Chaps. 2-4 are 
described and basic exploratory analysis of two data-sets is presented. 

Chapter 2 starts with linear regression that is a basic ingredient of many different 
forecasting algorithms. Several methods from time-series data analysis are presented 
including hugely popular ARIMA models. Recently developed permutation-based 
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methods are included, based on their focus on peaks, and this is up to our knowledge 
for the first time that those methods have a place of their own in a review of popular 
methods. We hope that the time will show their usefulness. Support vector machines 
and artificial neural networks, with examples from both forward feed and recurrent 
networks, are representing machine learning based methods. 

Chapter 3 concerns the probabilistic theory underpinning extreme values of 
independent and identically distributed observations. In the way it is presented here, 
this theory relies strongly on the analytic theory of regular variation, following 
closely the work developed by Laurens de Haan. The content of this chapter will lay 
the foundations to the stochastic properties and corresponding statistical method- 
ology presented in Chap. 4. 

The methodology for inference on extreme values addressed in Chap. 4 has its 
focus narrowed down, as we go along, to the case of short tails with a finite upper 
bound to suit the specific application to the Irish smart meter data described in 
Chap. 1. This class of short-tailed distributions being tackled includes, but is not 
limited to Beta distributions and alike. We will be working on the max-domain of 
attraction rather than pretending that the limiting distribution provides an exact fit to 
the sampled data. This will enable a stretch to those distributions attaining finite 
boundary despite being attached to the Gumbel domain of attraction, thus endowed 
with more realistic characteristics than the typified exponential fit. Chapter 4 is 
drawn to a close with a brief literature review on recent theory for extremes of 
non-identically distributed random variables. 

Finally, in Chap. 5 short-term prediction with the focus on peaks is illustrated 
comparing methods described in Chap. 2 using a subset of publicly available data 
from Thames Valley Vision project. 

Chapter 1 was written by all three authors. Chapter 2 has Maria Jacob and DVG 
as authors, and Chap. 3 is authored by CN and Maria Jacob. Chapters 4 and 5 are 
authored by CN and DVG. 

The book is designed for any student or professional who wants to study these 
topics at a deeper level and assumes a wide range of different technical back- 
grounds. We hope that the book will be also useful for teaching. While we have 
attempted to balance mathematical rigour with accessibility to people with different 
technical backgrounds, the presented techniques are illustrated using the real-life 
data, and the corresponding code can be found on GitHub. 


Reading, UK Maria Jacob 
Reading, UK Cláudia Neves 
Milton Keynes, UK Danica Vukadinović Greetham 
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Chapter 1 A) 
Introduction E 


Electricity demand or load forecasts inform both industrial and governmental deci- 
sion making processes, from energy trading and electricity pricing to demand 
response and infrastructure maintenance. Electric load forecasts allow Distribution 
System Operators (DSOs) and policy makers to prepare for the short and long term 
future. For informed decisions to be made, particularly within industries that are 
highly regulated such as electricity trading, the factors influencing electricity demand 
need to be understood well. This is only becoming more urgent, as low carbon tech- 
nologies (LCT) become more prevalent and consumers start to generate electricity 
for themselves, trade with peers and interact with DSOs. 

In order to understand and meet demands effectively, smart grids are being devel- 
oped in many countries, including the UK, collecting high resolution data and making 
it more readily accessible. This data allows for better analysis of demand, identifica- 
tion of issues and control of electric networks. Indeed, using high quality forecasts 
is one of the most common ways to understand demand and with smart meter data, 
it is possible to know not only how much electricity is required at very high time 
resolutions, but also how much is required at the substation, feeder and/or household 
level. 

However, this poses new challenges. Mainly, load profiles of individual house- 
holds and substations are harder to predict than aggregated regional or national load 
profiles due to their volatile nature. The load profiles at the low voltage level contain 
irregular peaks, or local maxima, which are smoothed out when averaged across 
space or time. Once aggregated, the profiles are smoother, and easier to forecast. The 
work presented in this book outlines the challenge of forecasting energy demand at 
the individual level and aims to deepen our understanding of how better to forecast 
peaks which occur irregularly. 

Even more uncertainty arises from the drastic changes to the way society has used 
electricity thus far and will do so in the future. Many communities have moved away 
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from gas and coal powered technologies to electrically sourced ones, especially for 
domestic heating [1]. Moreover, where households and businesses were previously 
likely to be consumers, government policies incentivising solar energy have lead to 
an increase in photovoltaic panel installations [2], meaning that the interaction with 
the electricity grid/ DSO will become increasingly dynamic. 

In addition to this, governments are also diversifying the source of electricity 
generation, i.e. with renewable and non-renewable sources [3, 4] and incentivising 
the purchase of electric vehicles [5] in a bid to reduce national and global green- 
house emissions. This evolution of societal behaviour as well as governmental and 
corporate commitments to combat climate change is likely to add more volatility to 
consumption patterns [6] and thereby increase uncertainty. Most likely the changing 
climate itself will drive different human behaviours to current ones and introduce 
yet more unknowns to the problem. Therefore, while the literature on forecasting of 
electricity load is large and growing, there is a definite need to revisit the topic to 
address these issues. As demand response, battery control and peer-to-peer energy 
trading are all very sensitive to peaks at the individual or residential level, particular 
attention will be given to forecasting the peaks in low-voltage load profiles. 

While the change of attention from average load to peak load is not new, a novel 
approach in terms of electricity load forecasting, is to adapt the techniques from a 
branch of statistics known as Extreme Value Theory (EVT). We will speak in depth 
about it in later chapters but we briefly share a sense of its scope and our vision for its 
application to the electricity demand forecasting literature. We can use the methods 
from EVT to study the bad-case and the worst-case scenarios, such as blackouts 
which, though rare, are inevitable and highly disruptive. Not just households [7] but 
businesses [8] and even governments [9] may be vulnerable to risks from blackouts 
or power failure. In order to increase resilience and guard against such high impact 
events, businesses in particular may consider investing in generators or electricity 
storage devices. However, these technologies are currently expensive and the pur- 
chase of these may need to be justified through rigorous cost-benefit analyses. We 
believe that the techniques presented in this book and that to be developed throughout 
the course of this project could be used by energy consultants to assess such risks 
and to determine optimal electricity packages for businesses and individuals. 

As one of our primary goals is to study extremes in electricity load profiles and 
incorporate this into forecasts for better accuracy, we will first consider the forecasting 
algorithms that are commonly suggested in the literature and how and where these 
algorithms fail. The latter will be done by (1) considering different error measures 
(the classic approach in load forecasting) and (2) by studying “heteroscedasticity” 
in forecasts errors (an EVT approach), which for the moment can be understood 
as the irregular frequency of large errors or even the inability of the algorithm to 
predict accurately over time. We will also estimate the upper bound of the demand. 
We believe that DSOs will be able to use these kinds of techniques to realistically 
assess what contractual obligations to place upon individual customers and thereby 
tailor their contracts. They may also prove useful in demand response strategies. 

In this book, we will consider two smart meter data sets; the first is from smart 
meter trials in Ireland and the second is collected as part of the Thames Valley Vision 
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(TVV) Project in the UK. The Irish smart meter trials is available publicly and so 
has been used in many journal papers and is a good starting point. However, little 
information about the households is available. The TV V Project on the other hand is 
geographically compressed on a relatively small area, allowing weather and other data 
aboutthe area to be collected. The substation data is available at higher time resolution 
than the Irish smart meter data and subsequently provides more information with 
which to build statistical models. Combining both the classic forecasts with the 
results from EVT, we aim to set benchmarks and describe the extreme behaviour. 
While both case studies relate to energy, particularly electricity, the methods 
presented here are by no means exclusively for this sector; they can be and have 
been applied more broadly as we will see in later chapters. Thus, the work presented 
in this book may also serve to illustrate how results from EVT can be adapted to 
different disciplines. Furthermore, this book may also prove conducive to learning 
how to visualise and understand large amounts data and checking of underlying 
assumptions. In order to facilitate adaptations to other applications and generally 
share knowledge, some of the code used in this work has been made accessible 
through GitHub! so those teaching or attending data science courses may use it to 
create exercises extending the code, or to run experiments on different data-sets. 


1.4 Forecasting and Challenges 


Electricity load forecasts can be generated for minutes and hours in advance to years 
and decades in advance. Forecasts of different lengths assist in different applica- 
tions, for example forecasts for up to a day ahead are generated for the purpose of 
demand response or battery control, whereas daily to yearly forecasts may be pro- 
duced for energy trading, and yearly to decade forecasts allow for grid maintenance 
and investment planning and informing energy policy (Fig. 1.1). 

Most studies in electric load forecasting in the past century have focused on point 
load forecasting, meaning that at each time point, one value is provided, usually an 
average. The decision making process in the utility industry relies mostly on expected 
values (averages) so it is no surprise that these types of forecasts have been the 
dominant tool in the past. However, market competition and requirements to integrate 
renewable technology have inspired interest in probabilisitic load forecasts (PLF) 
particularly for system planning and operations. PLF may use quantiles, intervals 
and/or density functions [10]. We will review the forecast literature in more detail 
in Chap. 2, focusing mostly on point/deterministic forecasts. It is worth noting that 
many of those point-forecast methods can be implemented for quantiles prediction. 

It becomes evident from various electric load forecasting reviews presented by 
Gerwig [11], Alfares and Nazeeruddin [12], Hong and Fan [10], that many algorithms 
of varying complexity exist in the literature. However, for many reasons they are not 
always particularly good in predicting peaks [13]. The fundamental idea behind most 
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Second Hour Day Week Month Year Decade 


STLF LTLF 
EEEE 


Fig. 1.1 The various classifications for electric load forecasts and their applications. Based on: 
Hong and Fan [10]. The abbreviations are Short Term Load Forecasting (STLF), and Long Term 
Load Forecasting (LTLF) 


forecasting algorithms is that a future day (or time) is likely to be very much like days 
(or times) in the past that were similar to it with regard to weather, season, day of the 
week, etc. Thus, algorithms mostly use averaging or regression techniques to generate 
forecasts. This brings us back to the first challenge mentioned earlier: such algorithms 
work well when the demand profiles are smooth, for example due to aggregation at 
the regional and/or national level, but when the profiles are irregular and volatile, 
the accuracy of forecasts is reduced. This is usually the case for households or small 
feeder (sometimes called residential) profiles. In this way, it becomes obvious that 
we need algorithms that can recreate peaks in the forecasts that are representative of 
the peaks in the observed profiles. 

This brings us to the second challenge: in order to determine which algorithms 
perform well and which perform better (or worse), we need to establish benchmarks 
and specify how we measure accuracy. There are many ways of assessing the quality 
of forecasts, or more strictly many error metrics that may be used. Some conven- 
tional error metrics for load forecasts are mean absolute percentage error (MAPE) 
and mean absolute error (MAE) (see Sect. 2.2.1). These are reasonably simple and 
transparent and thus quite favourable in the electric load forecasting community. 
However, as noted by Haben et al. [14], for low-voltage networks, a peaky forecast is 
more desirable and realistic than a flat one but error metrics such as MAPE unjustly 
penalise peaky forecasts and can often quantify a flat forecast to be better. This is 
because the peaky forecast is penalised twice: once for missing the observed peak 
and again for forecasting it to be where it did not occur, even if only slightly shifted 
in time. Thus, some other error measures have been devised recently that tackle this 
issue. We will review these more in Chap. 2. 

Both of these challenges can also be approached from an EVT point of view. On 
the one hand, peaks in the data can be thought of as local extremes. By considering 
how large the observations can feasibly become in future, we may be able to quantify 
how likely it is that observations exceed some large threshold. Equally, as discussed 
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before, we can use heteroscedasticity to describe how behaviour deviates from the 
"typical" in time, which may help us to understand if particular time windows are 
hard to predict, thereby assessing uncertainty. 

Ultimately, we want to combine the knowledge from both these branches and 
improve electricity forecasts for each household. Of course, improving forecasts of 
individual households will improve forecasting ability overall, but DSOs are also 
interested in understanding how demand evolves in time and the limits of consump- 
tion. How much is a customer ever likely to use? When are peaks likely to happen? 
How long will they last? Knowing this at the household level can help DSOs to 
incentivise flexibility, load spreading or ‘peak shaving’. Such initiatives encourage 
customers to use less electricity when it is in high demand. Load spreading informed 
only by regional and national load patterns may prove counter productive at the sub- 
station level; for example, exclusive night time charging of electric vehicles, as this 
is when consumption is nationally low, without smart algorithms or natural diver- 
sity may make the substations or feeders vulnerable to night time surges, as pointed 
out in Hattam et al. [15]. Thus, understanding local behaviour is important to both 
informing policy and providing personalised customer services. 

Before we delve into the theory and methods, we familiarise ourselves with Irish 
smart meter data in Sect. 1.2.1 and with the TVV data in Sect. 1.2.2. 


1.2 Data 


1.2.1 Irish Smart Meter Data 


The first case study uses data obtained from Irish Social Science Data Archive [16]. 
The Smart Metering Project was launched in Ireland in 2007 with the intention 
of understanding consumer behaviour with regard to the influence of smart meter 
technology. To aid this investigation, smart meters were installed in roughly 5000 
households. Trials with different interventions were ran for groups of households. 
The data used in this book are from those households, which were used as controls in 
the trials. Therefore, they were not taking part in any intervention (above and beyond a 
smart meter installation). This gives complete measurements for 503 households. We 
have further subset the data to use only 7 weeks, starting in August 2010, where the 
weeks are labelled from 16 to 22 (inclusive). No bank holidays or other national 
holidays were observed in this period. Measurements were taken at half hourly 
resolution which are labelled from 1 to 48 where 1 is understood to correspond 
to midnight. Additionally days are also numbered from 593 (16th of August 2010) 
to 641. From this, the days of the weeks, ranging from 1 to 7 where 1 is Monday and 
7 is Sunday, were deduced. Regardless of the number of occupants, each household 
is considered to be the unit and the terminology of “customer” and “household” are 
used interchangeably and equivalently throughout. 
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Fig. 1.2 Histogram, logarithmic y scale, and box-plot of half hourly measurements in Irish smart 
meter data 


We now familiarise ourselves with the data at hand. Consider both the histogram 
and the box plot shown in Fig. 1.2. The 75th percentile for this data is 0.5 kWh 
meaning that three quarters of the observations are below this value, however some 
measurements are as high as 12 kWh. Generally, large load values can be attributed to 
consumers operating a small business from home, having electric heating, multiple 
large appliances and/or electric vehicles in their home. However, electric vehicle 
recharging does not seem to be a plausible explanation in this data set as it is a 
recurring, constant and prolonged activity and such a sustained demand was not 
observed in any of the profiles. Other large values are roughly between 9 and 10 
kWh so we may ask ourselves, what caused such a large surge? Was it a one time 
thing? How large can that value get within reason? How long can it last? We will 
address this specific question when we consider “endpoint estimation" in Chap. 4 
and for which the theoretical background will be reviewed in Chap. 3. 

While Fig. 1.2 tells us about half hourly demand, Fig. 1.3 gives some general 
profiles. These four plots show the total/cumulative pattern of electricity demand. 
The top left plotin Fig. 1.3 shows the dip in usage overnight, the increase for breakfast 
which stabilises during typical working hours with a peak around lunch and rises 
finally again for dinner, which is when it is at its highest on average. Similarly, the 
top right plot of Fig. 1.3 shows the total daily consumption for each day in the 7 week 
period. The plot highlights a recurring pattern which indicates that there are specific 
days in the week where usage is relatively high and others where it is low. This is 
further confirmed by the image on the bottom left which tells us that, in total, Fridays 
tend to have the lowest load, whereas weekends typically have the highest. Finally, 
the image on the bottom right shows a rise in demand starting in week 18, which is 
around the beginning of September, aligning with the start of the academic year for 
all primary and some secondary schools in Ireland. This explains why the jump in 
data occurs as the weeks preceding are weeks when many families may travel abroad 
and thus record less electricity demand in their homes. 

It is also valuable to see how the top left profile of Fig. 1.3 changes for each day 
of the week. From Fig. 1.4, it is obvious that there are some differences between 
weekdays and weekends; the breakfast peak is delayed on weekends but no categor- 
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Fig. 1.3 Cumulative demand profiles in kiloWatt hours (kWh) for various time horizons 
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Fig. 1.4 Total load profiles for each day of the week 


ical differences are obvious for the evening peaks between weekends and weekdays. 
Notice that both the top left image of Fig. 1.3 and the weekday profiles in Fig. 1.4 
show three peaks: one for breakfast around 8 am, another for lunch around 1 pm and 
the third in the evening which is sustained for longer. While we are not currently 
exploring the impact and benefits of clustering, we may use these three identifiers to 
cluster households by their usage in the future. 

Already, we can see the basis for the most forecasting algorithms that we men- 
tioned before. When profiles are averaged, they are smooth and thus overall averaging 
techniques may work well. Furthermore, if most Sundays record high usage, then it is 
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Fig. 1.5 Electric load day d against day d — 1 in kWh 


sensible to use profiles from past Sundays to predict the demand for future Sundays, 
i.e. to use similar days. 

In a similar way, it may be sensible to use similar time windows on corresponding 
days, that is using past Sunday evenings to predict future Sunday evenings. One 
way to see if this holds in practice as well as in principle is to consider correlation. 
Figure 1.5 shows the relationship between the daily demand of each household on day 
d against the daily demand on day d — 1. Each marker indicates a different household 
though it should be noted that there is not a unique colour for each. There seems to 
be evidence of a somewhat linear trend and some variation which may be resulting 
from the fact that weekends have not been segregated from weekdays and we are not 
always comparing similar days. To see how far back this relationship holds, an auto- 
correlation function (Fig. 1.6) is provided. The auto-correlation function is for the 
aggregated series given by the arithmetic mean of all customers, I Ya xi, where 
x; is the load of the ith household, at each half hour. The dashed line represents the 
95% confidence interval. As can be seen, there is some symmetry and while it is not 
shown here there is also periodicity throughout the data set though with decreasing 
auto-correlation. This gives us the empirical foundation to use many of the forecasts 
which rely on periodicity for accuracy. 

Finally, and as a prelude to what follows in Chap. 5, one way to see if there are 
"extreme" households is to consider the daily total demand of each household. This 
is shown in Fig. 1.7, again with each marker representing different households as 
before. Itis noteworthy thatthere is one house (coloured in light blue) that consistently 
appears to be using the most amount of electricity per day. This may be an example 
of a household where the occupants are operating a small business from home. 
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Fig. 1.6 Auto-correlation function for 1 day. Lag is measured in half hour 
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Fig. 1.7 Total daily demand for each household 


1.2.2 Thames Valley Vision Data 


This second case study uses data that was collected as a part of Scottish and Southern 
Electricity Network (SSEN) Thames Valley Vision project (TV V)? funded by the 


?7http://www.thamesvalleyvision.co.uk/our-project/. 
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Fig.1.8 Histogram, logarithmic y scale, and box-plot of half hourly measurements in TVV data 


UK gas and electricity regulator Ofgem through the Low Carbon Networks Fund and 
Network Innovation Competition. The project's overall aim was to monitor and model 
atypicallow voltage network using monitoring in households and substations in order 
to simulate future realistic demand scenarios. Bracknell, a moderate sized town west 
of London was chosen as it hosts many large companies and the local network, with 
its urban and rural parts, is representative of much of Britain's electricity network. 

This data set contains profiles for 226 households? on half-hourly resolution 
between 20th March 2014 and 22nd September 2015. The measurements for these 
households are timestamped and as was done for the Irish smart meter data, infor- 
mation of the day of the week, half hour of the week was deduced. We have also 
added a period of the week which marks each half hour in a week and ranges from 1, 
corresponding to 00:15 on Monday, to 336, corresponding to 23:45 on Sunday. We 
have also subset the data to include only full weeks. Thus, in this section, the analysis 
is presented for observations taken between 24th March 2014 and 20th September 
2015, spanning 546 days which is 78 weeks of data. 

We again start by considering the histogram and box plot of all measurements 
(Fig. 1.8). The largest value in this data set is 7.623 kWh, which is much smaller than 
our last case study, whereas the 75th percentile is 0.275 kWh. Though the magnitudes 
of these values are not the same, the general shape of the histogram here is similar 
to that of the Irish smart meter data; they are both left skewed and large values are 
relatively few. 

The box plot presented in Fig. 1.9 shows the consumption for each household. 

Next, we consider the general patterns and trends in the load. We do this by 
considering the average consumption. Let us start with the top left image of Fig. 1.10. 
Firstly, it shows that measurements were taken 15 min after and before the hour. The 
mean profile also appears to be less smooth as expected, than in the case of the 
Irish smart meter data, as they are less households. Still some fundamental and 
qualitative similarities persist; on average, electricity demand is low at night. This 
increases sharply after around 6 am and reaches its peak around 7.45 am. This surge 
in demand stabilises until a small peak during typical lunch time hours. Again, the 


3http://data.ukedc.rl.ac.uk/simplebrowse/edc/Electricity/NT V V/EPM. 
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Fig. 1.9 Box-plot of electricity load of each household in the TVV data 
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Fig. 1.10 Average demand profiles in (kWh) for various time horizons in the TVV data 


evening peak is still the period of highest demand; the peak reaches higher than 0.3 
kWh and is sustained for roughly 3 h. Note that if a household has electric vehicle, 
this will change the demand profile. However, as we discussed before, the presence 
of electric vehicles will change not just the timing of this high demand but also 
magnitude and duration. 

Of course, these values depend on the time of year and the day of the week as 
shown in the top right and bottom left plots of Fig. 1.10. The seasonal and annual 
cycle for daily average demand is obvious from the top right plot. Recall that day 1 
corresponds to the 20th of March 2014. Although it would be valuable to have an 
even longer time series, there are some periods for which two consecutive seasons of 
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data are present. This in general helps in forecasting, because it enables modelling 
seasonal and annual cycles. 

The weekly cycle shown in the bottom left plot of Fig. 1.10 is again in line 
with what we saw with the Irish smart meter data. On average, the weekends have 
high electricity consumption with lowest average demands being recorded between 
Wednesday and Friday. It may be possible that Mondays are relatively high because 
this plot does not differentiate between Mondays which are weekdays and Mondays 
which are bank holidays. We will consider this shortly. 

Finally, the bottom right plot in Fig. 1.10 reaffirms the seasonal cycle; winter 
months on average have higher electricity demand than do summer months. This is 
due to increased lighting, but it is also possible that there are at least some houses in 
the sample that heat their homes using electricity. Note while this seasonality may 
be important to model when forecasting aggregated load, it may be less important 
for forecasting individual load (see e.g. Haben et al. [17], Singh et al. [13], where 
gas heating is more prominent, like in most parts of the UK (see Department for 
Business, Energy & Industrial Strategy, UK [18]). 

While a day may be classified into the day of the week, we may also classify 
them by whether it is a holiday or working day and whether it succeeds a holiday 
or working day. Thus, we now consider how the top left plot of Fig. 1.10 changes 
depending on such a classification. 

The days and times were classified into 4 categories: working day followed by 
working day (ww), working day followed by a holiday (hw), holiday followed by a 
working day (wh) and holiday followed by a holiday (hh). All Sundays were classified 
as "hh" but weekdays can be classified as “hh” for example for Christmas or other 
bank holidays. Tuesdays to Fridays are mostly qualified as “ww” except when they 
occur immediate after Easter weekend, Christmas, boxing day, or new year's day, in 
which case they were classified as “hw”. As expected, Saturdays are mostly qualified 
as “wh” or as “hh” when they succeeded Fridays which were national holidays. The 
load profiles separated by these day classifications are shown in Fig. 1.11. 
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Fig. 1.11 Average demand profiles in (kWh) for each day classification by time of day for TVV 
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Again, we see qualitatively similar behaviour as the Irish smart meter data; the 
breakfast peaks occur earlier on working days and at similar times regardless of 
whether the previous day was a holiday or not. As was the case for the Irish smart 
meter data, the evening peaks are not distinguishably different between working 
days and holidays; the main difference is for day time consumption. In general, bank 
holidays and Sundays have the highest usage, Saturdays and other ordinary non- 
working days use slightly less but still significantly more than working days. The 
day time usage on working days is the lowest. 


1.3 Outline and Objectives 


As was mentioned before, the work that is presented in this book is the first part of a 
project, which aims to incorporate analyses of extremes into forecasting algorithms 
to improve the accuracy of forecasts for low-voltage networks, that is substations, 
feeders and households. Thus, it is an amalgamation of two research areas, which till 
now have remained relatively separate, in order to inform and affect decision making 
within the energy industry. 

Thus far, we have considered only generally the value of the current line of infer- 
ence to the utility industry. In what proceeds, we aim to give a thorough review of 
the literature and provide more specific reasons for why each method is used and 
discuss its shortcomings. In Chap. 2, we will explore in depth the literature of short 
term load forecasts (STLF). Within it, we will consider some industry standards, 
introduce some recent forecasting algorithms, and discuss forecast validation and 
uncertainty. After that, we will deviate for two chapters into the theory of extremes 
(Chap. 3) and the statistics of extremes (Chap. 4), both of which form the cornerstones 
of the work presented in the case studies in Chaps. 4 and 5. Presented forecasting 
and extremes techniques are illustrated in case studies. Benchmarks for end-point 
estimators of electric profiles and forecasting algorithms are established, some mod- 
ifications offered and crucially analyses of extremes is provided, which in return 
feeds into forecasts and their validation. 
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Chapter 2 A) 
Short Term Load Forecasting get 


Electrification of transport and heating, and the integration of low carbon technolo- 
gies (LCT) is driving the need to know when and how much electricity is being 
consumed and generated by consumers. It is also important to know what external 
factors influence individual electricity demand. 

Low voltage networks connect the end users through feeders and substations, 
and thus encompass diverse strata of society. Some feeders may be small with only 
a handful of households, while others may have over a hundred customers. Some 
low voltage networks include small-to-medium enterprises (SMES), or hospitals and 
schools, but others may be entirely residential. Furthermore, local feeders will also 
likely register usage from lighting in common areas of apartments or flats, street 
lighting and other street furniture such as traffic lights. 

Moreover, the way that different households on the same feeder or substation 
use electricity may be drastically different. For example, load profiles of residential 
households will vary significantly depending on the size of their houses, occupancy, 
socio-demographic characteristics and lifestyle. Profiles will also depend on whether 
households have solar panels, overnight storage heating (OSH) or electric vehicles 
[1]. Thus, knowing how and when people use electricity in their homes and com- 
munities is a fundamental part of understanding how to effectively generate and 
distribute electrical energy. 

In short term load forecasting, the aim is to estimate the load for the next half 
hour up to the next two weeks. For aggregated household demand, many different 
methods are proposed and tested (see e.g. Alfares and Nazeeruddin [2], Taylor and 
Espasa [3], Hong and Fan [4], etc.). Aggregating the data smooths it, therefore makes 
it easier to forecast. The individual level demand forecasting is more challenging and 
comes with higher errors, as shown in Singh et al. [5], Haben et al. [1]. The growth 
of literature on short term load forecasting at the individual level has started with the 
wider access to higher resolution data in the last two decades, and is still developing. 
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Fig. 2.1 Aggregations of different number of households 


So, why is it not enough to look at electricity from an aggregated point of view? 
Firstly, aggregated load profiles may not reflect individual load profiles, as can be 
seen from the example in Fig. 2.1. Here, the load has been aggregated for different 
number of households in one week and the subsequent smoothing is evident. Not 
only are aggregated load profiles smoother, they also tend to have stronger seasonality 
and weather dependency than disaggregated load profiles [6]. Demand side response, 
which encompasses efforts to modify consumption patterns, can be better informed 
by forecasts which can predict irregular peaks. This is especially true with distributed 
generation and when both demand and supply become dynamic with LCT integration. 

Secondly, as noted by Haben et al. [1], aggregations of individual load profiles 
do not consider network losses or other loads that are usually not monitored, such 
as traffic lights and street furniture. Having information of all load allows for better 
modelling and hence more efficient energy distribution and generation. 

Thirdly, considering aggregated load profiles tells us little about individual house- 
holds or businesses, who may benefit from having tailored energy pricing plans and 
contracts or need help for informed decision making regarding investment in batter- 
ies, photovoltaic and other LCT [7]. The enabling of these kinds of decision making 
processes is one of the central motivations of the research presented in this book. 
To do so, we want to consider forecasting methods from both statistics and ma- 
chine learning literature, specifically the state of the art forecasts within different 
categories, at the time of writing, and compare them. 

In the past, forecasting individual households and feeders was a challenge not just 
because new forecasting techniques were developing, but also because of the lack 
of access to a high quality data. The availability of smart meter data alleviates this 
hindrance and gives new opportunity to address this challenge. 


2 Short Term Load Forecasting 17 


Over the course of this chapter, we will consider several different forecasting 
algorithms stemming from various areas of mathematics. In Sect. 2.1, we consider the 
literature on different forecasts and discuss their strengths and weaknesses. Similarly, 
in Sect. 2.2, we will consider some popular ways of validating forecasts and discuss 
the merits, limitations and appropriateness of each. In the discussion in Sect. 2.3, we 
will motivate the choices of forecasts and error measures used for the case studies to 
be presented in Chap. 5. 


2.1 Forecasts 


Historically, forecasts have been generated to represent typical behaviour and thus 
have mostly relied on expectation values. Consequently, many popular algorithms in 
the literature, and in practice, are point load forecasts using averaging techniques [8] 
and indeed considering aggregations as mentioned above. Point load forecasts refer 
to forecasts which give a single, usually mean, value for the future load estimate. 
Regardless, the need to integrate LCT, market competition and electricity trading 
have brought about the need for probabilistic load forecasts which may include 
intervals, quantiles, or densities as noted by Hong and Fan [4] and Haben et al. [1]. 
In either point load forecasting or probabilistic load forecasting, many approaches 
exist and increasingly mixed approaches are being used to create hybrid profiles to 
better represent load with irregular peaks. 

The challenge that we are interested in addressing in this chapter is the following: 
given past load (measured in kWh), we want to create a week-long forecast with the 
same time resolution as the data, for one or more households. While electricity is 
consumed continuously, we work with time-stamped, discrete load measurements 
denoted by y; where t = (1,2, ..., N} denotes time and usually obtained from a 
smart meter. 

In this section, we will review several forecasting algorithms. We will illustrate the 
analyses presented in this chapter in a case study Sect. 5.1, using the TVV endpoint 
monitor data described in Sect. 1.2.2. 


2.1.1 Linear Regression 


Different techniques based on linear regression have been widely used for both short 
term and long term load forecasting. They are very popular due to the simplicity 
and good performance in general. Regression is used to estimate the relationship 
between different factors or predictors and the variable we want to predict. Linear 
regression assumes that these relationships are linear and tries to find the optimal 
parameters (or weights) so that the prediction error is minimal. This enables us to 
easily introduce different kind of variables such as calendar variables, past load and 
temperature. The basic model for multiple linear regression (MLR) is given by 
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Y =B xi +, (2.1) 


where y, is the dependent variable at time t which is influenced by the p independent 
variables x, = (l, xi1, X12, ..., Xip)” and B = (9, £i, ..., Bp)” are the correspond- 
ing regression parameters. The random error term, €;, is assumed to be normally 
distributed with zero mean and constant variance o? > 0, i.e. eV (0, o°). Also, 
E(e€,) = 0, fort z s. 

The dependent variable or series is the one we are interested in forecasting, 
whereas the x, contains information about the factors influencing the load such 
as temperature or a special day. 

As noted in the tutorial review by Hong and Fan [4], the regressions coefficients 
or parameters are usually estimated using ordinary least squares using the following 
formula: 


n -l n 
B = > xx? 2»: x+y; (2.2) 
t=1 t=1 


^ 


The least squares estimator for 8 is unbiased, i.e., E[G] = 3. We also make the con- 
nection that the least squares estimator for G is the same as the maximum likelihood 
estimator for 6 if the errors, €,, are assumed to be normally distributed. 

This simple linear model is then basis for various forecasting methods. We start 
by listing several examples for aggregated load forecast. For example, Moghram 
and Rahman [9] explored MLR, amongst others, to obtain 24h ahead hourly load 
forecast for a US utility whilst considering dry bulb! and dew point temperature? as 
well as wind speed. Two models were calibrated, one for winter and one for summer. 
The authors divided the day into unequal time zones which corresponded roughly to 
overnight, breakfast, before lunch, after lunch, evening and night time. It was found 
that dividing the day in this way resulted in a better fit than not dividing the day at all 
or dividing it equally. The authors also found significant correlations for temperature 
and wind speed when using the MLR model. 

Charlton and Singleton [10] used MLR to create hourly load forecasts. The re- 
gression model considered temperature (up to the power of two), day number, and 
the multiplication of the two. The created model accounts for the short term effects 
of temperature on energy use, long term trends in energy use and the interaction 
between the two. Further refinements were introduced by incorporating smoothed 
temperature from different weather stations, removing outliers and treating national 
holidays such as Christmas as being different to regular holidays. Each addition 
resulted in reduction in errors. 


! Dry bulb temperature is the temperature as measured when the thermometer is exposed to air but 
not to sunlight or moisture. It is associated with air temperature that is most often reported. 

?Dew point temperature is the temperature that would be measured if relative humidity is 100% 
and all other variables are unchanged. Dew point temperature is always lower than the dry bulb 
temperature. 
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In a similar vein to the papers above, Alfares and Nazeeruddin [2] consider nine 
different forecasting algorithms in a bid to update the review on forecasting methods 
and noted the MLR approach to be one of the earliest. The aim was to forecast the 
power load at the Nova Scotia Power Corporation and thus pertained to aggregated 
load. They found the machine learning algorithms to be better overall. The vast 
literature surveyed in Alfares and Nazeeruddin [2], Hong and Fan [4] and many other 
reviews, show linear regression to be popular and reasonably competitive despite its 
simplicity. 

While the most common use of regression is to estimate the mean value of the 
dependent variable, when the independent variables are fixed, it can be also used to 
estimate quantiles [1, 11]. 

The simple seasonal quantile regression model used in Haben and Giasemidis 
[11] was updated in Haben et al. [1] and applied to hourly load of feeders. Treating 
each half-hour and week day as separate time-series, the median quantile is estimated 
using the day of trial, three seasons (with sin and cos to model periodicity), a linear 
trend and then temperature is added using a cubic polynomial. 

To find optimal coefficients for linear regression, one usually relies on ordinary 
least squares estimator. Depending on the structure of a problem, this can result in 
an ill-posed problem. Ridge regression is a commonly used method of regularisation 
of ill-posed problems in statistics. Suppose we wish to find an x such that Ax — 
b, where A is a matrix and x and b are vectors. Then, the ordinary least squares 
estimation solution would be obtained by a minimisation of || Ax — b||;. However 
for an ill-posed problem, this solution may be over-fitted or under-fitted. To give 
preference to a solution with desirable properties, the regularisation term ||Ux ||; is 
added so that the minimisation is of || Ax — b||2 + ||['x||2. This gives the solution 


$ — (ATA 4 T?T) | ATH 


2.1.2 Time Series Based Algorithms 


The key assumptions in classical MLR techniques is that the dependent variable, 
yr, is influenced by independent predictor variables x, and that the error terms are 
independent, normally distributed with mean zero and constant variance. However, 
these assumptions, particularly of independence, may not hold, especially when 
measurements of the same variable are made in time, say owing to periodic cycles 
in the natural world such as seasons or in our society such as weekly employment 
cycles or annual fiscal cycle. As such, ordinary least squares regression may not be 
appropriate to forecast time series. Since individual smart meter data may be treated 
as time series, we may borrow from the vast body of work that statistical models 
provide, which allow us to exploit some of the internal structure in the data. In 
this section, we will review the following time series methods: autoregressive (AR) 


?In the Bayesian interpretation, simplistically this regularised solution is the most probable solution 
given the data and the prior distribution for x according to Bayes' Theorem. 
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models (and their extensions), exponential smoothing models and kernel density 
estimation (KDE) algorithms. 


2.1.2.1 Autoregressive Models 


Time series that stem from human behaviour usually have some temporal depen- 
dence based on our circadian rhythm. If past observations are very good indicators 
of future observations, the dependencies may render linear regressions techniques 
an inappropriate forecasting tool. In such cases, we may create forecasts based on 
autoregressive (AR) models. In an AR model of order p, denoted by AR(p), the 
load at time ¢, is a sum of a linear combination of the load at p previous times and a 
stochastic error term: 


P 
y=at > iyii te, (2.3) 


i=1 


with a is a constant, à; are AR parameters to be estimated, p is the number of 
historical measurements used in the estimation and e denotes the error term which 
is typically assumed to be independent with mean 0 and constant variance, a”. In a 
Way, we can see some similarity between the MLR model and the AR model; in the 
MLR, load is dependent on external variables but in the AR model, load is a linear 
combination of previous values of load. 

An example of using an AR model to estimate feeders' load is given in Haben 
et al. [1]. Here, the model is applied to residuals of load, 7, = y; — ji, where p is an 
expected value of weekly load. The most obvious advantage of using the residuals is 
that we can define r, in a such way that it can be assumed to be stationary. In addition, 
Hı models typical weekly behaviour and thus changing the definition of ju, allows the 
modeller to introduce seasonality or trends quite naturally and in various different 
ways, as opposed to the using load itself. In Haben et al. [1], the AR parameters were 
found using the Burg method.^ Seasonality can be introduced by including it in the 
mean profile, u. 

Other examples of AR models and their modifications include Moghram and 
Rahman [9], Alfares and Nazeeruddin [2], Weron [12] and Taylor and McSharry 
[13], but most of these are studies with aggregated load profiles. 

Since we expect that past load is quite informative in understanding future load, 
we expect that AR models will be quite competitive forecasts, especially when built 
to include trends and seasonality. 


^The Burg method minimises least square errors in Eq. (2.3) and similar equation which replaces 
r, .j With ripi. 
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2.1.2.2 Seasonal Autoregressive Integrated Moving 
Average—SARIMA Models 


From their first appearance in the seminal Box & Jenkins book in 1970 (for the 
most recent edition see Box et al. [14]), autoregressive integrated moving average 
(ARIMA) time series models are widely used for analysis and forecasting in a wide 
range of applications. The time series y, typically consists of trend, seasonal and 
irregular components. Instead of modelling each of the components separately, trend 
and seasonal are removed by differencing the data. The resulting time series is then 
treated as stationary (i.e. means, variances and other basic statistics remain unchanged 
over time). As we have seen in the previous section, AR models assume that the 
predicted value is a linear combination of most recent previous values plus a random 
noise term. Thus, 


p 
y=at o» + &, 


i=1 


where a is a constant, ¢ are weights, p is a number of historical values considered 
and e, ~ N (0, o?). The moving average model (MA) assumes the predicted value 
to be the linear combination of the previous errors plus the expected value and a 
random noise term, giving 


q 
y= u+ X biei + €, 


i=1 


where jz is the expected value, 0 are weights, q is the number of historical values 
considered, and e, ~ N (0, o?). The main parameters of the model are p, d and q, 
where p is the number of previous values used in the auto-regressive part, d is the 
number of times we need to difference the data in order to be able to assume that is 
stationary, and q is the number of previous values used in the moving average part. 
When strong seasonality is observed in the data, a seasonal part, modelling repetitive 
seasonal behaviour, can be added to this model in a similar fashion, containing its 
own set of parameters P, D, Q. A SARMA model, seasonal autoregressive moving 
average model for 24 aggregated energy profiles is explored in Singh et al. [5] based 
on 6s resolution data over a period of one year. Routine energy use is modelled with 
AR part and stochastic activities with MA part. A daily periodic pattern is captured 
within a seasonal model. The optimal parameters were determined as p — 5 and 
q — 30. Theleasterror square minimisation was used, where the results with different 
parameter values were compared and the ones that minimised the error were picked 
up. Interestingly, SARMA not only outperformed other methods (support vector, 
least square support vector regression and artificial neural network with one hidden 
layer of ten nodes) regarding mean load prediction, but also regarding peak load 
prediction, resulting in smaller errors for peaks. 

(S)ARMA and (S)ARIMA models can be extended using exogenous variables 
such as temperature, wind chill, special day and similar inputs. These are called 
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(S)ARIMAX or (S)ARMAX models, for example Singh et al. [5] gives the following 
ARMAX model 


p d r 
» 2a Gyi te+u+) biet) GTi, 
i=l i=l 


i=1 


where (3s are further parameters that represent the exogenous variables 7;, for instance 
the outdoor temperature at time t. Two simple algorithms Last Week (LW) and 
Similar Day (SD) can be seen as trivial (degenerate) examples of AR models with 
no error terms, and we will used them as benchmarks, when comparing different 
forecasting algorithms in Chap. 5. The Last Week (LW) forecast is a very simple 
forecast using the last week same half-hour load to predict the current one. Therefore, 
it can be seen as an AR model where p = 1, d = 0, a = 0, ġı = 1, & = 0. 

The Similar Day (SD) forecast instead uses the average of n last weeks, same 
half-hour loads to predict the current one. Therefore, it can be seen as an AR model 
where p=n,d=0,a=0, d,..-¢, = L6 = 0. 


2.1.2.3 Exponential Smoothing Models 


The simplest exponential smoothing model puts exponentially decreasing weights 
on past observations. 

Suppose we have observations of the load starting from time t = 1, then the 
single/simple exponential smoothing model is given by 


S, = ay, + (1— a)S;-1, (2.4) 


where a € (0, 1), S; is the output of the model at ¢ and the estimate for the load 
at time ¢ + 1. Since the future estimates of the load depend on past observations 
and estimates, it is necessary to specify Sı. One choice for Sı = yı, but this puts 
potentially unreasonable weight on early forecasts. One may set Sı to be the mean 
of the first few values instead, to circumvent this issue. Regardless, the smaller the 
value of o, the more sensitive the forecast is to the initialisation. 

In the single exponentially smoothed model, as o tends to zero, the forecast tends 
to be no better than the initial value. On the other hand, as o tends to 1, the forecast 
is no better than the most recent observation. For o = 1, it becomes the LW forecast 
given in the previous section. The choice for œ may be made by the forecaster, 
say from previous experience and expertise or it may chosen by minimising error 
functions such as a mean square error. 

When the data contains a trend, a double exponential smoothing model is more 
suitable. This is done by having two exponential smoothing equations: the first on 
the overall data (2.5) and the second on the trend (2.6): 
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S, — ay, + (l — o)(Si-1 + bii), (2.5) 
b, = BS; — S) + 0 — B)bis, (2.6) 


where b; is the smoothed estimate of the trend and all else remains the same. We now 
have a second smoothing parameter that must also be estimated. Given the model 
in (2.5) and (2.6), the forecast for load at time t + m is given by y; = S; + mb,. 

Of course, we know that electricity load profiles have daily, weekly and annual 
cycles. Taylor [15] considered the triple exponential smoothing model, also known as 
the Holt-Winters exponential smoothing model, to address the situation where there 
is not only trend, but intraweek and intraday seasonality. The annual cycle is ignored 
as it is not likely to be of importance for forecasts of up to a day ahead. Taylor [15] 
further improved this algorithm by adding an AR(1) model to account for correlated 
errors. This was found to improve forecast as when the triple exponential model 
with two multiplicative seasonality was used, the one step ahead errors still had large 
auto-correlations suggesting that the forecasts were not optimal. To compensate, the 
AR term was added to the model. 

Arora and Taylor [16] and Haben et al. [1] used a similar model, though without 
trend, to forecast short term load forecast of individual feeders with additive intraday 
and intraweek seasonality. Haben et al. [1] found that the so called Holt-Winters- 
Taylor (HWT) triple exponential smoothing method that was first presented in Taylor 
[17] was one of their best performing algorithms regardless of whether the temper- 
ature was included or omitted. 

A model is given by the following set of equations: 


yr = S51 + dics, Wiss, + $e; + €, 

€; = y, — (Sici + dios, + Wiss), 

S, = S;-1 + Aer, (2.7) 
d, = di-s, + ôer, 


Wi = Wis, + Wêr, 


where y, is the load, S, is the exponential smoothed variable often referred to as 
the level, w, is the weekly seasonal index, d, is the daily seasonal index, s2 = 168, 
5; = 24 (as there are 336 half-hours in a week and 48 in a day), e, is the one step 
ahead forecast error. The parameters A, ô and w are the smoothing parameters. This 
model has no trend, but it has intraweek and intraday seasonality. The above mention 
literature suggests that when an exponential model is applied, the one-step ahead 
errors have strong correlations that can be better modelled with an AR(1) model, 
which in (2.7) is done through the $ term. The k-step ahead forecast is then given 
by S, + Wes 4k + ote, from the forecast starting point f. 
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2.1.24 Kernel Density Estimation Methods 


Next, we briefly consider the kernel density estimation (KDE), that is quite popular 
technique in time-series predictions, and has been used for load prediction frequently. 
The major advantage of KDE based forecasts is that they allow the estimation of the 
entire probability distribution. Thus, coming up with probabilistic load forecasts is 
straight forward and results are easy to interpret. Moreover, a point load forecast can 
be easily constructed, for example by taking the median. This flexibility and ease 
of interpretation make kernel density forecasts useful for decision making regarding 
energy trading and distribution or even demand side response. However, calculating 
entire probability density functions and tuning parameters can be computationally 
expensive as we will discuss shortly. 

We divide the KDE methods into two broad categories, conditional and uncon- 
ditional. In the first instance, the unconditional density is estimated using historical 
observations of the variable to be forecasted. In the second case, the density is con- 
ditioned on one or more external variables such as time of day or temperature. The 
simplest way to estimate the unconditional density using KDE is given in (2.8): 


FO 9 Y Ku, 6i - D. (2.8) 


¡=l 


where {y1, ..., Li} denotes historical load observations, K} () = K (-/ h)/ h denotes 
the kernel function, hz > 0 is the bandwidth and f (y) is the local density estimate at 
point y which takes any value that the load can take. If instead, we want to estimate 
the conditional density, then: 


2 Kj. (Xi — x) Kn, Oi — D) 


fap) == (2.9) 
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where hz > 0 and ^, > 0 are bandwidths. KDE methods have been used for energy 
forecasting particularly in wind power forecasting but more recently Arora and Taylor 
[18] and Haben et al. [1] used both conditional and unconditional KDE methods to 
forecast load of individual low voltage load profiles. Arora and Taylor [18] found 
one of the best forecasts to be using KDE with intraday cycles as well as a smoothing 
parameter. However, Haben et al. [1] chose to exclude the smoothing parameter as 
its inclusion costs significant computational efforts. In general, the conditional KDE 
methods have higher computational cost. This is because the optimisation of the 
bandwidth is a nonlinear which is computationally expensive and the more variables 
on which the density is estimated, the more bandwidths must be estimated. 

In the above discussion, we have omitted some details and challenges. Firstly, 
how are bandwidths estimated? One common method is to minimise the difference 
between the one step ahead forecast and the corresponding load. Secondly, both 
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Arora and Taylor [18] and Haben et al. [1] normalise load to be between 0 and 1. 
This has the advantage that forecast accuracy can be more easily discussed across 
different feeders and this also accelerates the optimisation problem. However, (2.8) 
applies when / can take any value and adjustments when the support of the density is 
finite. Arora and Taylor [18] adjust the bandwidth near the boundary whereas Haben 
et al. [1] do not explicitly discuss the correction undertaken. The choice of kernels 
in the estimation may also have some impact. The Gaussian kernel? was used in both 
of the papers discussed above but others may be used, for example Epanechnikov? 
or biweight kernels. 


2.1.3 Permutation Based Algorithms 


Though the methods discussed in the above section are widely used forecasting tools, 
their performances on different individual smart meter data-sets vary. Some of the 
mentioned algorithms have smoothing properties and thus, they may be unsuitable 
when focusing on individual peak prediction. We now list several permutation-based 
algorithms that are all based on the idea that people do same things repeatedly, but 
in slightly different time periods. This is of relevance for modelling demand peaks. 


2.1.3.1 Adjusted Average Forecast 


One of the simple forecasts we mentioned before at the end of Sect. 2.1.2.1, Similar 
day (SD) forecast averages over the several previous values of load. For example, 
to predict a load on Thursday 6.30 pm, it will use the mean of several previous 
Thursdays 6.30 pm loads. But what happens if one of those Thursdays, a particular 
household is a bit early (or late) with their dinner? Their peak will move half an hour 
or hour earlier (or later). Averaging over all values will smooth the real peak, and the 
mistake will be penalised twice, once for predicting the peak and once for missing 
earlier (later) one. Haben et al. [19] introduced a new forecasting algorithm which 
iteratively updates a base forecast based on average of previous values (as the SD 
forecasting), but allows permutations within a specified time frame. We shall refer 
to it as the Adjusted Average (AA) forecast. The algorithm is given as follows: 


(i) For each day of the week, suppose daily profiles G are available for past N 
weeks, where k — 1,..., N. By convention, G is the most recent week. 

(ii) A base profile, F“”, is created whose components are defined by the median of 
corresponding past load. 


6K (u) = 3(1 — u?) for |u| < 1 and K (u) = 0 otherwise. 
7K(u) = Ba —u?)? for |u| < 1 and K (u) = 0 otherwise. 
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(iii) This baseline is updated iteratively in the following way. Suppose, at itera- 
tion k, we have F for 1 <k<N-—1, then F“* is obtained by setting 


a (k) AU A A , 
Fe) — BH (6 + kF®), where G = PG™ with P € P being a per- 
mutation matrix s.t. ||PG — F® ||, = min || PG — FL. P is the set of 
Pe 


restricted permutations i.e, for a chosen time window, w, the load at half hour i 
can be associated to the load at half hour j if |i — j| < w. 


N 
(iv) The final forecast is then given by F? = Xu (x â” F p) 
k=1 

In this way, the algorithm can permute values in some of historical profiles in order to 
find the smallest error between observed and predicted time series. This displacement 
in time can be reduced to an optimisation problem in bipartite graphs, the minimum 
weight perfect matching in bipartite graphs, [20], that can be solved in polynomial 
time. 

A graph G = (V, E) is bipartite if its vertices can be split into two classes, so that 
all edges are in between different classes. Two bipartite classes are given by obser- 
vations y; and forecasts f;, respectively. Errors between observations and forecasts 
are used as weights on the edges between the two classes. Instead of focusing only at 
errors e; = y; — f, (i.e. solely considering the edges between y, and f;), differences 
between 


yi = fii Yi — fies Yi = fi-3 Yt — fien s Yt = ficos Yt — Situs 


are also taken into account, for some plausible time-window w. It seems reasonable 
not to allow, for instance, to swap morning and evening peaks, so w should be kept 
small. 

These differences are added as weights and some very large number is assigned 
as the weight for all the other possible edges between two classes, in order to stop 
permutations of points far away in time. Now, the perfect matching that minimises 
the sum of all weights, therefore allowing possibility of slightly early or late fore- 
casted peaks to be matched to the observations without the double penalty is found. 
The minimum weighted perfect matching is solvable in polynomial time using the 
Hungarian algorithm Munkres [21], with a time complexity of O(n(m +n logn)) 
for graphs with n nodes (usually equal to 2 x 48 for half-hourly daily time series 
and m edges (~ 2 x n x w). It is important to notice that although each half-hour 
is considered separately for prediction, the whole daily time series is taken into ac- 
count, as permutations will affect adjacent half-hours, so they need to be treated 
simultaneously. 


2.1.3.2. Permutation Merge 


Based on a similar idea, Permutation Merge (PM) algorithm presented in Charlton 
et al. [22] uses a faster optimisation—the minimisation of p-adjusted error (see 
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Fig. 2.2 An example of permutation merge 


Sect. 2.2.2.2) to match peaks in several profiles simultaneously, based on finding a 
shortest path in a directed acyclic graph (a graph with directed edges and no cycles). 
Either Dijkstra algorithm or a topological sort can be used for that Schrijver [20]. 

Given the n previous profiles, the algorithm builds a directed acyclic graph be- 
tween each point in time and its predecessors and successors inside a time-window 
w, allowing for permutations of values in that window. The cost of each permutation 
is the difference of the two values that is caused by permutation. Then the minimum 
weighted path gives an ‘averaged’ profile with preserved peaks. 

As the algorithms complexity is O(nw’ 4"), where n is the number of historic 
profiles, N is the length of time series and w is time window where permutations 
are allowed, only small ws are computationally feasible. If we have two profiles of 
length five, x = [0, 0, 3, 0, 0] and y = [3, 0, 0, 0, 0] and w = 1, so we can permute 
only adjacent values, the constructed graph and the minimum length (weighted) path 
is given below on Fig. 2.2. As we have two profiles, and are trying to find two 
permutations that will give us the minimum difference with the median of those two 
profiles, in each times-step we have 4 possibilities: (0, 0) means both profiles stay 
the same, (0, 1) the first stays the same, the second is permuted, (1, 0) the first is 
permuted, the second stays the same, and (1, 1) means both are permuted. As we 
have to have perfect matching we have n + 1 = 6 layers in the graph, and some paths 
are not available. The solution gives us [0, 3, 0, 0, 0] for both profiles. 


2.1.3.3 Adjusted k-nearest Neighbours and Extensions 


Valgaev et al. [23] combined the p-adjusted error from Sect. 2.2.2.2 and PM using the 
k-nearest neighbour (KNN) regression algorithm. The standard kNN algorithm starts 
by looking for a similar profile in historic data. This is usually done by computing 
Euclidean based distance between the profiles and returning k minimum distance 
ones. Then the arithmetic mean is computed and returned as a prediction. Here, 
instead of the Euclidean distance, the p-adjusted error is used, and instead of com- 
puting an arithmetic mean, permutation merge is used to compute adjusted mean. 
This approach is extended to Adjusted feature aware k-nearest neighbour (AFKNN) 
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in Voß et al. [24] using external factors (temperature, bank holiday, day of week), 
with one difference. Instead of using adjusted error, the Gower distance 


N CP) Cf) 
Ix =x 


1 
Del, j) = N » max x) — min x)’ 
is deployed. This is computationally demanding, but can result in better performance 
than PM in average as it has been shown in Vof et al. [24]. 

The advantage of permutation based algorithms, as mentioned above, is that these 
iterative permutations allow forecasted load profiles to look more like the observed 
load profiles. They are better able to replicate the irregular spikes than the more 
common averaging or regression based algorithms. However, some error measures 
such as those that will be discussed in Sect. 2.2.1, can doubly penalise peaky forecasts. 
Both Charlton et al. [22] and Haben et al. [19] demonstrate how a flat “average” 
forecast is only penalised once for missing the observed peak whereas if a peak is 
forecasted slightly shifted from when it actually it occurs, it will be penalised once 
for missing the peak and again for forecasting it where it was not observed. 


2.1.4 Machine Learning Based Algorithms 


Machine learning algorithms such as artificial neural networks and support vector 
machines have been remarkably successful when it comes to understanding power 
systems, particularly for high voltage systems [25, 26] or aggregated load [2, 9, 27]. 
The big advantage of machine learning techniques is that they can be quite flexible 
and are capable of handling complexity and non-linearity [12, 28]. 

However, the parameters such as weights and biases in a machine learning frame- 
work do not always have similarly accessible physical interpretations as in the statis- 
tical models discussed, Moreover, some machine learning algorithms such as those 
used for clustering do not include notions of confidence intervals [29]. Nonetheless, 
since they have such large scope within and outside of electricity forecasting and 
since we are mostly interested in point load forecasting in this book, we review two 
key methods within artificial neural networks, multi-layer perceptron and long short 
term memory network, and discuss support vector machines. 


2.1.4.4 Artificial Neural Networks 


Artificial Neural Networks (ANN) are designed to mimic the way the human mind 
processes information; they are composed of neurons or nodes which send and receive 
input through connections or edges. From the input node(s) to the output node(s), a 
neural network may have one or more hidden layers. The learning may be shallow 
i.e. the network has one or two hidden layers which allows for faster computation. 
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Or it may be deep, meaning it has many hidden layers. This then allows for more 
accurate forecasts, but at the cost of time and complexity. When there is a need to 
forecast many customers individually, computational and time efficiency is a practical 
requirement for everyday forecasting algorithms. Thus, shallow neural networks such 
as multi-layer perceptron (MLP) with one hidden layer tended to be used frequently 
[30-32]. 


2.1.4.2 Multi-layer Perceptron 


MLP is an example of a feedforward ANN. This means that the network is acyclic, 
i.e. connections between nodes in the network do not form a cycle. MLP consist of 
three or more layers of nodes: the input layer, at least one hidden layer, and an output 
layer. 

Nodes have an activation function which defines the output(s) of that node 
given the input(s). In MLP, activation functions tend to be non-linear with com- 
mon choices being the rectifier (f(x) = xt = max(0, x)), the hyperbolic tangent 
(f (x) = tanh(x)), or the sigmoid function (f (x) = 1/(1 + e *)) and the neural net- 
work is trained using a method known as backpropagation. 

Briefly, it means that the gradient of the error function is calculated first at the 
output layer and then distributed back through the network taking into accounts 
the weights and biases associated with the edges and connections in the network. 
Gajowniczek and Zabkowski [32] and Zufferey et al. [31] are two recent examples 
of the use of MLP to individual smart meter data both with one hidden layer. 

Gajowniczek and Zabkowski [32] had 49 perceptrons in the input layer, 38 per- 
ceptrons in the hidden layer and the 24 perceptrons in the output layer to coincide 
with hourly load forecasts. However, Gajowniczek and Zabkowski [32] was tried on 
load profile on one household where many details such as occupant number, list of 
appliances were known. Of course, such information is not freely available. 

Zufferey et al. [31], on the other hand, tested a MLP forecast on a larger number 
of households and found that 200 perceptrons in the hidden layer was a reasonable 
trade-off between accurate predictions and reasonable computation time. They also 
found that the inclusion of temperature had limited influence on forecast accuracy, 
which is similar to the findings of Haben et al. [1] using time-series methods. 


2.1.4.3 Long-Short-Term-Memory 


Even more recently, Kong et al. [33] used a type of recurrent neural network (RNN) 
known as the long short-term memory (LSTM) RNN. 

These types of models have been successfully used in language translation and 
image captioning due to the their architecture; since this type of RNN have links 
pointing back (so they may contain directed cycles, unlike the neural networks dis- 
cussed before), the decision they make at a past time step can have an impact on the 
decision made at a later time step. 
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In this way, they are able to pick up temporal correlation and learn trends that 
are associated with human behaviour better than traditional feed-forward neural 
networks. When compared to some naive forecasting algorithms such as the similar 
day forecast as well as some machine learning algorithms, Kong et al. [33] found 
that LSTM network was the best predictor for individual load profiles, although with 
relatively high errors (MAPE, see Sect.2.2.1, was still about 4446 in the best case 
scenario for individual houses). 

The LSTM network that has been implemented in Kong et al. [33] has four inputs: 
(1) the energy demand from the K past time steps, (ii) time of day for each of the 
past K energy demand which is one of 48 to reflect half hours in a day, (iii) day of 
the week which is one of 7, (iv) a binary vector that is K long indicating whether 
the day is a holiday or not. Each of these are normalised. The energy is normalised 
using the min-max normalisation.? 

The normalisation of the last three inputs is done using one hot encoder.’ The 
LSTM network is designed with two hidden layers and 20 nodes in each hidden layer. 
The MAPE (see Sect. 2.2) is lowest for individual houses when LSTM network is used 
when K — 12 though admittedly the improvement is small and bigger improvements 
came from changing forecasting algorithms. 


2.1.4.4 Support Vector Machines 


Support Vector Machines (SVM) are another commonly used tool in machine learn- 
ing, though usually associated with classification. As explained in Dunant and 
Zufferey [28], SVM classify input data by finding the virtual boundary that sep- 
arates different clusters using characteristics which can be thought of the features. 
This virtual boundary is known as the hyper-plane. Of course, there may exist more 
than one hyper-plane, which may separate the data points. Thus the task of an SVM 
is to find the hyper-plane such that the distance to the closest point is at a maximum. 

We may then use the SVM for regression (SVR) to forecast load as it has been 
done in Humeau et al. [27] and Vrablecová et al. [34]. In this case, instead of finding 
the function/hyper-plane that separates the data, the task of the SVR is to find the 
function that best approximates the actual observations with an error tolerance e. For 
non-linear solutions, the SVR maps the input data into a higher dimensional feature 
space. 

Humeau et al. [27] used both MLP and SVR to forecast load of single household 
and aggregate household using data from the Irish smart meter trials. The goal was 
to create an hour ahead forecast and 24 h ahead forecast. The features used included 


8Suppose observations are denoted by x. These can be normalised with respect to their minimum 
x—min(x) 
max(x)—min(x) * 


°The one hot encoder maps each unique element in the original vector that is K long to a new vector 
thatis also K long. The new vector has values 1 where the old vector contained the respective unique 
element and zeros otherwise. This is done for each unique element in the original vector. 


and maximum values as follows: z — 
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hour of the day and day of the week in calendar variables and the load from the past 
three hours as well as temperature records in Dublin. 

In order to deal with variability and to give an indication of the evolution of 
load, the authors also add load differences, i.e. L; — L;. ; and L; — 2Li-1 + Lj. 5. 
Humeau et al. [27] noticed that for individual households, both for the hour ahead 
and the 24h ahead, the linear regression outperforms the SVR which the authors 
do not find surprising. The effectiveness lies in the aggregated load cases where the 
internal structure, which the SVR is designed to exploit, is clearer. 

More recently, Vrablecová et al. [34] also used SVR method to forecast load from 
the Irish smart meter data. Many kernel functions, which map input data into higher 
dimensions, were tried. The best results were found using radial basis function kernels 
and it was noted that sigmoid, logistic and other nonparametric models had very poor 
results. For individual households, SVR was not found to be the best methods. 

Thus, from the reviewed literature we conclude that, while SVR is a promis- 
ing algorithm of forecasting aggregated load, the volatility in the data reduces its 
effectiveness when it comes to forecasting individual smart meter data. 


2.2 Forecast Errors 


As more and more forecasting algorithms become available, assessing how close the 
forecast is to the truth becomes increasingly important. However, there are many ways 
to assess the accuracy of a forecast depending on the application, depending on the 
definition of accuracy and depending on the need of the forecaster. Since one of the 
earlier comparisons of various forecasting algorithms by Willis and Northcote-Green 
[35], competitions and forums such as “M-competitions” and “GEFcom” have been 
used to bring researchers together to come up with new forecasting algorithms and 
assess their performance. As noted by Hyndman and Koehler [36] and Makridakis 
and Hibon [37], these competitions help set standards and recommendations for 
which error measures to use. 


2.2.1 Point Error Measures 


The mean absolute percentage error (MAPE) defined as 


fi = ài 
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, (2.10) 
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where f = (fi,..., fy) isthe forecast anda = (aj, ..., ay) is the actual (observed) 
load, is one of the most common error measures in load forecasting literature. 
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It is scale independent, so it can be used to compare different data-sets [36]. It 
is advantageous because it has been historically used and thus often forms a good 
benchmark. It is also simple and easily interpreted. However, it is not without flaws. 
As noted in Hyndman and Koehler [36] if 3 i, a; = 0, MAPE is undefined. Further- 
more, as a point error forecasts, it suffers from the double penalty effect which we 
shall explain later. In this book, we adopt a common adjustment that allows for data 
points to be zero and define the MAPE to be 


N 
È Ifi -ail 
MAPE = 1009,51 —— (2.11) 
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where f = (fi,..., fy) isthe forecast anda = (aj, ..., ay) is the actual (observed) 
load, 

The Mean Absolute Error (MAE) is also similarly popular due to its simplicity, 
although it is scale dependent. We define it as follows 


N 
1 

MAE =—)|f,-ail. 2.12 
ea ai| (2.12) 


However, similar to the MAPE, the MAE also susceptible to doubly penalising peaky 
forecasts. As pointed out in Haben et al. [1], the scale-dependence of MAE can be 
mitigated by normalising it. In this way the authors were able to compare feeders of 
different sizes. In our case normalising step is not necessary as we compare different 
algorithms and look for an average over the fixed number of households. 

Haben et al. [19] consider a p-norm error measure. We define it here by 


N : 
E, = ||f — allp = (yr) ; (2.13) 


i=1 


where p > 1. In this book, as in Haben et al. [19], we take p = 4 as it allows larger 
errors to be penalised more and smaller errors to be penalised less. Thus, we will use 
E; in order to focus on peaks. 

Lastly, we also consider the Median Absolute Deviation (MAD) which is defined 
the median of | f; — aj| fori = 1,..., N. The MAD is considered more robust with 
respect to outliers than other error measures. 
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2.2.2 Time Shifted Error Measures 


In this section we present some alternatives to the standard error measures listed in 
Sect. 2.2.1. Suppose the actual profile has just one peak at time k and the forecast 
also has just one peak at time k + D where i > 0 is small (say maybe the next or 
previous time unit). The point error measures in Sect. 2.2.1 penalise the forecast 
twice: once for not having the peak at time k and a second time for having a peak at 
time k + i. A flat forecast (fixed value for all time) under these circumstances would 
have a lower error even though in practice it is not a good or useful to the forecaster. 
To deal with such problems, it would be intuitively beneficial to be able to associate 
a shifted peak to a load, including some penalty for a shift, as long as it is within 
some reasonable time-window. In this section we discuss two ways of comparing 
time series: dynamic time warping and permuted (so called adjusted) errors. 


2.2.2.1 Dynamic Time Warping 


Dynamic Time Warping (DTW) is one of the most common ways of comparing two 
time series not necessarily of equal length. It has been used in automatic speech 
recognition algorithms. 

Dynamic Time Warping calculates an optimal match between two sequences given 
the some condition: suppose you have time series x and y with length N and M, 
respectively. Firstly, the index from X can match with more than one index of Y, and 
vice versa. The first indices must match with each other (although they can match 
with more than one) and last indices must match with each other. 

Secondly, monotonicity condition is imposed, i.e. if there are two indices, i < j 
for x and if i matches to some index / of y, then j cannot match to an index k of y 
where k < I. One can make this explicit: an (N, M)-warping path is a sequence p = 
(pl,..., pL) with pe = (ne, me) for £ € {1,..., L}, ne = (1,..., N}, and me = 
{1,..., M} satisfying: (1) boundary condition:p; = (1, 1) and pz = (N, M), (ii) 
monotonicity condition: nı < no €-:- «np andm, < m» <--- < mj and (iii) step 
size condition: pe+1 — pe € {(1, 0), (0, D, (1, D) for £ € {1,2,..., L — 1}. 

This is useful to compare electricity load forecast with observation as it allows us 
to associate a forecast at different time points with the load and still give it credit for 
having given useful information. However, there are some obvious downsides; the 
time step in the forecast can be associated with more than one observation which is 
not ideal. Moreover, if we were to draw matches as lines, lines cannot cross. This 
means that if we associate the forecast at 7.30 am with observation 8 am, we cannot 
associate forecast at 8 am with the observation at 7.30 am. 
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2.2.2.2 Adjusted Error Measure 


The adjusted error concept and algorithm to calculate it introduced by Haben et al. 
[19] addresses some of the issues. The idea of this permutation based error measure 
was used to create the forecasts discussed in Sect. 2.1.3. Given a forecast and an 
Observation over a time period, both sequences have the same length. Indices are 
fully (or perfectly) matched. Each index in one sequence is matched only to one 
index in the other sequence and any permutations within some tolerance window 
are allowed. The error measure assumes that an observation has been well enough 
forecasted (in time) if both the forecast and the observation are within some time 
window w. We define the Adjusted p-norm Error (ApE) by 


Ê” = min||P f — 2.14 
p = min ||P f —xllp, (2.14) 


where P again represents the set of restricted permutations. In Haben et al. [19], the 
minimisation is done using the Hungarian algorithm, but faster implementations are 
possible using Dijkstra's shortest path algorithm or topological sort as discussed by 
Charlton et al. [22]. 

While these may be intuitively suitable for the application at hand, they are com- 
putationally challenging and results are not easily conveyed to a non-specialist au- 
dience. The ApE also does not satisfy some properties of metrics In Voß et al. 
[38], a distance based on the adjusted p-norm error, the local permutation invari- 
ant (LPI) is formalised. Let P, denote the set of n x n permutation matrices. Let 
LY = {P = (pij € Pa: pj = 1 > |i — j| € w}. Then the function à : R” x R”, 
such that 

5(x, y) = min{|| Px — yl: P € £2} 


is an LPI distance induced by the Euclidean norm |. ||. 


2.3 Discussion 


Here, we have looked at several studies that review various load forecasting algo- 
rithms and how to assess and compare them. Clearly the literature on how to do this 
well for individual load profiles is an emerging field. Furthermore, only recently the 
studies regarding forecasting techniques for individual feeders/households compar- 
ing both machine learning algorithms and statistical models became available. This 
has been done in past for aggregated or high voltage systems [2, 9], but only recently 
for individual smart meter data. 

Linear regression is widely used for prediction, on its own or in combination 
with other methods. As the energy is used by all throughout the day, and people 
mostly follow their daily routines, autoregressive models, including ARMA, ARIMA 
and ARIMAX, SARMA and SARIMAX models are popular and perform well in 
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predicting peaks. Also triple exponential smoothing models, such as Holt-Winters- 
Taylor with intraday, intraweek and trend components are good contenders, while 
kernel-density estimators less so for individual households data. As expected, they 
work better on higher resolutions or aggregated level, where the data is smoother. 

Permutation-based methods are relatively recent development. They attempt to 
mitigate a *double penalty' issue that standard errors penalise twice slight inaccu- 
racies of predicting peaks earlier or later. Instead of taking a simple average across 
time-steps, with their adjust averaging they try to obtain a better *mean sample', and 
therefore to take into account that although people mostly follow their daily routine, 
for different reasons their routine may shift slightly in time. 

Finally, multi-layer perceptron and recurrent neural network appear to cope well 
with the volatility of individual profiles, but there is a balance of computational and 
time complexity and improvement, when comparing them with simpler, explainable 
and faster forecasts. 

There are yet many problems to be solved, such as the question of which features 
are important factors in individual load forecasting. While in general, there is an 
agreement that the time of the day, the day of the week and season are important 
factors for prediction, temperature, which is an important predictor of aggregate 
level seems to be not very relevant for prediction, (except for households with electric 
storage heating), due to the natural diversity of profiles being higher than temperature 
influence. 

We have also looked at the standard error measures in order to evaluate dif- 
ferent forecasting algorithms. While percentage errors such as are widely used as 
being scale-free and using absolute values they allow for comparison across different 
data-sets, we discuss the limitations: a small adjustment allows MAPE to cope with 
time-series with zero values, but it still suffers from a double penalty problem— 
trivial, straight line mean forecasts can perform better than more realistic, but im- 
perfect ‘peaky’ forecasts similarly to MAE. MAD error measure is introduced for 
error distributions that might be skewed, and 4-norm measure highlights peak errors. 
Alternatives that use time-shifting or permutations are also mentioned, as they can 
cope with a double penalty issue, but are currently computationally costly. 
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Chapter 3 A) 
Extreme Value Theory geai 


From travel disruptions to natural disasters, extreme events have long captured the 
public’s imagination and attention. Due to their rarity and often associated calamity, 
they make waves in the news (Fig. 3.1) and stir discussion in the public realm: is it a 
freak event? Events of this sort may be shrouded in mystery for the general public, 
but a particular branch of probability theory, notably Extreme Value Theory (EVT), 
offers insight to their inherent scarcity and stark magnitude. EVT is a wonderfully 
rich and versatile theory which has already been adopted by a wide variety of disci- 
plines in a plentiful way. From its humble beginnings in reliability engineering and 
hydrology, it has now expanded much further; it can be used to model the occur- 
rences of records (say for example in athletic events) or quantify the probability of 
floods with magnitude greater than what has been observed in the past, i.e it allows 
us extrapolate beyond the range of available data! 

In this book, we are interested in what EVT can tell us about electricity con- 
sumption of individual households. We already know a lot about what regions and 
countries do on average but not enough about what happens at the substation level 
or at least not with enough accuracy. We want to consider “worst” case scenario 
such as an area-wide blackout or the “very bad" case scenario such as a circuit fuse 
blowout or a low-voltage event. Distribution System Operators (DSO) may want to 
know how much electricity they will need to make available for the busiest time 
of day up to two weeks in advance. Local councils or policy makers may want to 
decide if a particular substation is equipped to meet the demands of the residents and 
if it needs an upgrade or maintenance. EVT can definitely help us to answer some 
of these questions and perhaps even more as we develop and adapt the theory and 
approaches further. 

There are many ways to infer properties about a population based on various 
sample statistics. Depending on the statistic, a theory about how well it estimates 
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the parameter of interest can be developed. The sample average is one such, very 
common, statistic. Together with the law of large numbers and subsequently the 
central limit theorem (as well as others), a well known framework exists. However, 
this framework is lacking in various ways. Some of these limitations are linked to the 
assumptions of the normal distribution of finite mean and variance. But what if the 
underlying distribution does not have finite variance or indeed even a finite mean? 
Not only this, the processes involved in generating a “typical” event may be different 
to the processes involved in generating an extreme event, e.g. the difference between 
a windy day and a storm event. Or perhaps extreme events may come from different 
distributions: for example, internet protocols are the set of rules that set standards on 
data being transmitted using the internet (or another network). Faloutsos et al. [1] 
concluded that power-laws can help analyse the average behaviour network protocols 
whereas simulations from [2] showed exponential distribution in the tails. 

EVT establishes the probabilistic and statistical theory of a different sample statis- 
tic: unsurprisingly, of extremes. Even though the study of EVT did not gain much 
traction before [3], some fundamental studies had already emerged in the earlier part 
of the twentieth century. While not the earliest analysis of extremes, the development 
of the general theory started with the work of [4]. The paper concerns itself with the 
distribution of the range of random samples drawn from the normal distribution. It 
was this work that officially introduced the concept of the distribution of the largest 
value. In the following years, [5, 6] evaluated the expected value and median of 
such distributions and the latter extended the question to non-normal distributions. 
The work of [7—9] gave the asymptotic distributions for the sample extremes. These 
works summatively give us the extreme value theorem and analogues of the central 
limit theory for partial or sample maxima. As this is one of the most fundamental 
results of EVT, we will explore it in more detail in Sect. 3.2. 
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In essence both the central limit theorem and the extreme value theorem are 
concerned with describing the same thing; an unusual event. The event may occur as 
a result from an accumulation of many events or from a single event which exceeds 
some critical threshold (or not), studied by [10]. Consider, a river whose water levels 
fluctuate seasonally. These fluctuations may erode its bank overtime or a single flash 
flooding may break the riverbank entirely. The first is a result of a cumulative effect 
with which the central limit theorem is concerned whereas the latter is the result of 
an event which exceeded what the riverbank could withstand, i.e. an extreme event, 
with which extreme value theory is concerned. 

Analogous to measuring “normal” behaviour using a mean, median or a mode, 
“extreme” behaviour may also be defined in multiple ways. Each definition will give 
rise to specific results which may be linked to, but different from, results derived 
from other definitions. However, these results complement each other and allow 
application to different scenarios/disciplines depending on the nature of the data and 
the question posed. The subject of EVT is concerned with extrapolation beyond the 
range of the sample measurements. Hence, it is an asymptotic theory by nature, i.e. 
results tell us what happens when sample size tends to infinity. 

The overarching goal of any asymptotic theory is two fold: 


1. to provide the necessary and sufficient conditions to ensure that a specific distri- 
bution function (d.f.) occurs in the limit, which are rather qualitative conditions, 
and 

2. to find all the possible distributions that may occur in the limit and derive a 
generalised form for those distributions. 


The first goal is known as the domains of attraction problem whereas the second 
is known as the limit problem. Before we take a look at the asymptotic theory, it is 
valuable to review some of the statistical background and terminology that will be 
prevalent throughout this report. 

The rest of this chapter is dedicated to an in depth exploration of extreme value 
theory, particularly the probabilistic foundations for the methods of inference pre- 
sented in Chap. 4 and ensuing application in Chap. 5. In the following section, we 
will introduce and clarify some of the central terminology and nomenclature. In 
Sect. 3.2, we will explore the fundamental extreme value theorem which gives rise to 
the generalised form to the limiting distribution of sample maxima and other relevant 
results. We will then consider results for the Peaks over threshold (POT) approach 
in Sect. 3.3. In Sect. 3.4, some theory on regularly varying functions is discussed. 
The theory of regular variation is quite central to EVT though often it operates from 
the background. Familiarity with regular variation is highly recommended for the 
readers interested in a mathematical and theoretical treatment of EVT. Finally, in 
Sect. 4.4, we consider the case where condition of identically distributed rv's can be 
relaxed. In Chaps.3 and 4 as a whole, we aim to provide intuitive understanding of 
the theoretical results, to expound reasons for these being in great demand to many 
applications and to illustrate them with some examples of challenges arising in the 
energy sector. 
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3.1 Basic Definitions 


As was mentioned earlier, various definitions of “extremes” give rise to different, 
but complementary, limiting distributions. In this section, we want to formalise what 
we mean by "extreme" as well as introduce the terminology that will be prevalent 
throughout the book. 

Suppose X1, X2, ..., Xn, ... is a sequence of independent random variables 
(r.v's). Throughout this book and until said otherwise we shall assume that all 
these rv's are generated by the same physical process and therefore it is reason- 
able to assume that any sample (X1, X2,..., Xn), made out of (usually the first) 
n random variables in the sequence, is a random sample of independent and iden- 
tically distributed (i.i.d) random variables with common distribution function (d.f.) 
F(x) := P(X < x),forallx e R. The non-stationary case, where the underlying d.f. 
function is allowed to vary with the time or location i of X; has been worked through 
in the extreme values context by [11]. We shall refrain to delve into detail on this case 
within this chapter but in Chap.4 we shall refer to the statistical methodology for 
inference on extreme that has sprang from the domain of attraction characterisation 
furnished by [11]. We shall assume the underlying df is continuous with probability 
density function (PDF) f. We also often talk about the support of X; this is the set 
of all values of x for which the pdf is strictly positive. The lower endpoint of the 
support of F or the left endpoint is denoted by xr i.e., 


xp := inf(x : F(x) > 0). 


Equivalently, we define the upper (or right) endpoint of F by 


xf := sup(x : F(x) < 1). (3.1) 


which can be either finite or infinite. When each of these values exist finite, these are 
probabilistically speaking the smallest and largest values that can ever be observed, 
respectively. In reality we are not likely to observe such extremes or if we do observe 
extremes we do not know how close they are to the endpoints. This is only to be 
aggravated by what is generally known in many applications, such as financial or 
actuarial mathematics, that there might no such thing as a finite upper bound. The 
main broad aim of EVT is centred around this idea. Its purpose is to enable estimation 
of tale-telling features of extreme events, right up to the level of that unprecedented 
extreme event so unlikely to occur that we do not expect it to crop up in the data. Until 
they do... Now that we have established that EVT sets about teetering on the bring of 
the sample, aiming to extrapolation beyond the range of the observed data, the sample 
maximum inherently features as the relevant summary statistic we will be interested 
in characterising. Preferably with a fully-fledged and complete characterisation, but 
flexible enough to be taken up by wider applied sciences. A probabilistic result 
pertaining to the sample maximum such that, for its simplicity and mild assumptions 
could serve as a gateway for practitioners to find bespoke tail-related models that 
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were not previously accessible or easily interpreted; a theorem like this would be 
dramatically useful. It turns out that such a result, with resonance often likened to the 
Central Limit Theorem, already exists. This theorem, the Extreme Value or Extremal 
Types theorem is the centrepiece to the next section. 

Before get underway to the asymptotic (or limit) theory of extremes, we need to 
familiarise ourselves with the following concepts of convergence for sequences: 


e A sequence of random variables X1, X», ... is said to converge in distribution to 


a random variable X [notation: X, a X] if, forall x € R, 
lim F,(x) = F(x). 
hn-—- œ 


This is also known as weak convergence. 


e The sequence converges in probability to X [notation: Xn 4x Jif, for any e > 0, 


lim P(|X, — X| > €) = 0. 
n oo 


e The sequence converges almost surely to X [notation: X, 2S xX]if, 


P ( lim X,=X)=1. 


n — oo 


Almost sure convergence is also referred to as strong convergence. 


3.0 Maximum of a Random Sample 


In the classical large sample (central limit) theory, we are interested in finding the 
limit distribution of linearly normalised partial sums, $, := X; + -+-+ Xn, where 
X1, X5, ..., Xn... arei i.d. random variables. Whilst the focus here is on the aggre- 
gation or accumulation of many observable events, non of these being dominant, 
the Extreme Value theory shifts to edge of the sample where, for its huge magnitude 
and potentially catastrophic impact, one single event dominate the aggregation of the 
data. In the long run, the maximum might not be any less than the sum. Although this 
seems a bold claim, its probabilistic meaning will become more glaringly apparent 
later on when we introduce the concept of heavy tails. 

The Central Limit theorem entails that the partial sum S, = X4 + Xo +... + X, 
of linearly normalised random variables, with constants a, > 0 and b,, drawn from 
an iid sequence is asymptotically normal, i.e. 


S, — b, d 
— —— —> N(0, 1). 


an n—oo 
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reflecting Charlie Winsor's principle that “All actual distributions are Gaussian in the 
middle". The suitable choice of constant for the Central Limit theorem (CLT) to hold 
in its classical form are b, = nE(X,) anda, = J/nVar(X;). Therefore, an important 
requirement is the existence of finite moment of second order, i.e. E|X |? < oc. 
This renders the CLT inapplicable to a number of important distributions such as 
the Cauchy distribution. We refer to [12, Chap. 1] for further aspects on the class of 
sub-exponential distributions, which includes but is far from limited to the Cauchy 
distribution. 

As it was originally developed, the Extreme Value theorem is concerned with par- 
tial maxima X, n := max(Xi, X5, ..., Xn) of an iid (or weakly dependent) sequence 
of rv's. Thus, the related extremal limit problem is to find out if there exist constants 
à, > O and b, such that the limit distribution to a, 1 (Xn,n — bn) is non-degenerate. It 
is worth highlighting that the sample maximum itself converges almost surely to the 
upper endpoint of the underlying distribution F to the sampled data, for the df of the 
maximum is given by P(X,,, < x) = F” (x) > Ipsyrj, as n — oo, with xf «oo 
and {Xn,n}n>1 recognisably a non-decreasing sequence. Here, the indicator function 
denoted by 14 is equal to 1 if A holds true and is 0 otherwise, meaning the probability 
distribution for the maximum distribution has mass confined to the upper endpoint. 
Hence, the matter now fundamentally lies in answering the following questions: (1) 
Is is possible to find a, > 0 and b, such that 


X, n b, . 
lim P (m x x) = lim F"(a,x + b,) = G(x), (3.2) 
n n— oo 


n oo a 


for all x continuity points of a non-degenerate cdf G? (ii) What kind of cdf G can 
be attained in the limit? (iii) How can be G be specified in terms of F? (iv) What are 
suitable choices for constants a, and b, question (i) is referring to? Each of these 
questions a addresses with great detail in the excellent book by [13]. 

The celebrated Extreme Value theorem gives us the only three possible distribu- 
tions that G can be. The extreme value theorem (with contributions from [3, 8, 14]) 
and its counterpart for exceedances above a threshold [15] ascertain that inference 
about rare events can be drawn on the larger (or lower) observations in the sample. The 
precise statement is provided in the next theorem. Corresponding result for minima 
is readily accessible by using the device X;,, = — max(— X1, —X2,..., — X,). 


Theorem 3.1 (Extremal Value Theorem). Let (X,),-1 bea sequence ofi.i.d. random 
variables with the same continuous df F. If there exist constants a, > 0 and b, € R, 
and some non-degenerate d.f. G such that Eq. (3.2) holds, then G must be only one 
of three types of d.f.'s: 
Fréchet 
®,(x) = exp(—x ?), x0 (3.3) 


Gumbel 
A(x) = exp(—e™*), x ER, (3.4) 
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Weibull 
Yal) = exp(—(—x)*), x x 0 (3.5) 


The Fréchet, Gumbel and Weibull d.f.'s can be in turn nested in a one-parameter 
family of distribution through the von Mises-Jenkinson parameterisation [16, 17]. 
Notably, the Generalised Extreme Value (GEV) distribution with df given by 


exp(-(14- 3x) I), y #0, 1+ 7x 20 


PROMUS exp(—e^), y=0. 


(3.6) 


The parameter y c, the so-called extreme value index (EVI), governs the shape of the 
GEV distribution. In the literature, the EVI is also referred to as the shape parameter 
of the GEV. We will explore this notion more fully after establishing what it means 
for F to be in the maximum domain of attraction of a GEV distribution. 


Definition 3.1. F said to be in the (maximum-) domain of attraction of G, [notation: 
F € D(G.)]ifitis possible to redefine constants a, > 0 and b, provided in Eq. (3.2) 
in such a way that, 

um F” (anx + bn) = G4(x), (3.7) 


with G., given by Eq. (3.6). 


We now describe briefly the most salient features in a distribution belonging to 
each of the maximum domains of attraction: 


1. Ify > 0, then the df F underlying the random sample (X4, X2,..., Xn) is said 
to be in the domain of attraction of a Fréchet distribution with d.f. given by ®1/,. 
This domain encloses all df's that are heavy-tailed. Qualitatively, this means that 
the probability of extreme events are ultimately non-negligible, and that upper 
endpoint x^ is infinite. Moreover, moments E (IX I) of order k > 1/y are not 
finite [18]. Formally, heavy-tailed distributions are those whose tail probability, 
] — F(x), is larger than that of an exponential distribution. Thus, noting that 
1 — G4(x) ~ «7 IP xU, as x — oo, we can then see that G.,.. is a heavy tailed 
distribution. Pareto, Generalised Pareto, and Inverse Gamma distributions are 
examples of distributions in the domain of attraction of the Fréchet distribution. 
Table2.1 in [19] gives a longer list of distributions in the Fréchet domain of 
attraction and the corresponding expansion of the tail distribution as well as the 
EVI in terms of the parameters of the distribution. 

2. At a first glance, the Gumbel domain of attraction would be the case of most 
simple inference as it is obvious that allowing to set y = 0, renders any esti- 
mation of EVI unecessary. However, the Gumbel domain attracts a plethora of 
distributions, sliding through a considerable range of tail-heavinesses, from the 
reversed Fréchet to the Log-Normal, and possessing either finite or infinite upper 
endpoint. Particular examples and a characterisation of the Gumbel domain can 
be found in [20]. 
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Fig. 3.2 Probability density function of several the extreme value distributions with varying a = 
1/|yl, o — 0 


3. Lastly, for y « 0, F is said to be in the domain of attraction of the Weibull 
distribution with d.f. Y_,,,. This domain contains short-tailed distributions with 
finite right endpoint. The case study presented in this report, that of electricity 
load of individual households, most likely belongs to the Weibull domain of 
attraction. This is because there is both a contractual obligation for customers to 
limit their electricity consumption and also physical limit to how much the fuse 
box can take before it short circuits. 


Figure 3.2 displays the prototypical distributions to each domain of attraction, for 
selected values of a. We highlight the long tails, polynomial decaying tails exhibited 
by the chosen Fréchet distributions, contrasting with the short, upper bounded, tails 
ascribed to the Weibull domain. 
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Now that we have tackled the extremal limit problem, we can start to dissect the 
domains of attraction. There are enough results on this to publish multiple chapters 
however not all results are pertinent to our case study. Thus, we take a smaller set 
of results for the sake of brevity and comprehension. We present the first theorem 
(Theorem3.2) which presents a set of equivalent conditions for F to belong to some 
domain of attraction condition. The proof for this (as for most results) are omitted 
but can be found in [18] (see Theorem 1.1.6). Theorem 3.2 allows us to make several 
important observations and connections. As noted before, we have two ways now 
to see that F € D(G,), one in terms of the tail distribution function 1 — F and the 
other in terms of the tail quantile function U defined as 


U(t) = (=o z r-(1 " J TEN (3.8) 


We note that the upper endpoint can thus be viewed as the ultimate quantile in the 
sense that x” = U (oo) := lim; U (t) < oo. Secondly, we have possible forms for 
b, and some indication as to what a, might be. 


Theorem 3.2. For y € R, the following statements are equivalent: 
1. There exists real constants a, > 0 and b, € R such that 


lim. F” (anx + bn) = G4(x) = exp (—(1 yx) ?), (3.9) 
n—> oo 


for all x with 1 + yx > 0. 
2. There is a positive function a such that for x > 0, 


: U(tx) — U(t) x?—] 
lim E 
1— o0 a(t) y 


, (3.10) 


where for y = 0, the right hand side is interpreted at continuity, i.e. taking the 
limit as y — 0 giving rise to log x. We use the notation U € ERV}. 
3. There is a positive function a such that 


Jim 1[1— Fax  U(0)] = (+ 9x)”, (3.11) 


for all x with 1 + yx > 0. 
4. There exists a positive function f such that 


~ 1-FE+af@) 
lim ———————— — 


-- —=1/7 
nu 1- FO =(1+ 7x) ; (3.12) 


for all x for which 1--»yx > 0. 


Moreover, a, and b, in Eq. (3.9) holds with a(n) and U (n), respectively. Also, f (t) = 
a(1/ — F(t))). 
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We see in the theorem above where theory of regular variation may come in 
with regard to U. Though we have deferred the discussion of regular variation to 
Sect. 3.4, it is useful to note that the tail quantile function U is of extended regu- 
lar variation (cf. Definition 3.6) and only if F belongs to the domain of attraction of 
some extreme value distribution. Extreme value conditions of this quantitative nature 
have resonance from a rich theory and toolbox that we can borrow and apply readily, 
making asymptotic analysis much more elegant. We can see what the normalising 
constants might be and we see that we can prove that particular d.f. belongs to the 
domain of attraction of a generalised extreme value distribution either using F or 
using U. We may look at another theorem which links the index of regular variations 
with the extreme value index. Proceeding along these lines, the next theorem bor- 
rows terminology and insight from the theory of regular variation; though defined 
later (Definition 3.4), a slowly varying function £ satisfies lim, — œ £(tx)/£(x) = 1. 
Theorem 3.3 gives us the tail distribution of F, denoted by F — 1 — F in terms 
of a regularly varying function and the EVI. Noticing that F is a regularly varying 
function means we can integrate it using Karamata's theorem (Theorem 3.13) which 
is useful for formulating functions f satisfying Eq. (3.12). 


Theorem 3.3. Let £ be some slowly varying function and F(x) :2 1 — F(x) denote 
the survival function, where F is the d.f. association with the random variable X. 
Let x* denote the upper endpoint of the df F. 


1. F is the Fréchet domain of attraction, i.e. F € D(G,) for y > 0, if and only if 


F(x) =x (x) — lim Ee i 


€ x hn, 
t>% 1] — F(t) 


for all x > 0. 
2. F is in the Weibull domain of attraction, i.e. F € D(G,) for y < 0, if and only 


if 


" 1 — F(xF — tx) 
F(xXF—xja2x ^g => lim —— ————— - x V^, 
(x X )—x (x) dc PFELD x 
for all x > 0. 
3. F is in the domain of attraction of the Gumbel distribution, i.e. y = 0 with 
xf «oo 


. l1-F(t T xf(t) x 
lim ————————— = 
(Tx 1— F(t) 


, 


for all x € R, with f a suitable positive auxiliary function. If the above equa- 
PF 
tion holds for some f, then it also holds with f (t) := [E (1 — F(s)Dds)/(1— 


F(t)) where the numerator of the integral exists finite for t < x". 


Theorem 3.4. F € D(G,) if and only if for some positive function f, 


. 1-Ft+xf@)) 
lim ——— —————- 


L -M 
um 1— FO =(1 +x) 7? (3.13) 
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for all x with 1+ yx > 0. If the above holds for some f, then it also holds with 


^t, y-0 
fi= —y(xF —t), 4 <0 
ma 1—F(x)dx 0 


1—F(t) : 


Moreover, any f that satisfies Eq. (3.13) also satisfies 


idiom 7 <0 


fO~AO, v7=9, 


where fi(t) is some function for which fj(t) — Oas t ^ xF. 


In this section, we have thus far mentioned a suitable function f which plays various 
roles however it should not be interpreted as probability density function of F, unless 
explicitly stated as such. Theorem 3.4 gives us alternative forms for f and its limit 
relations. 


Theorem 3.5. If F € D(G.), then 


1. for^y > 0: 
lim. F"(a,x) = exp(—x ^!) = 6.44 (x) 
n —> œO 


holds for x > 0 with a, := U (n); 
2. fory « 0: 


lim F" (a,x + x") = exp(-(—x) 7) = V (x) 
nao 


for x < 0 holds with a, := x" — U (n); 
3. fory =0: 
lim F" (a,x + bn) = exp (—e~*) = A(x) 
n-—- oo 


holds for all x with a, := f (U (n)) and b, := U (n) where f is as defined in 
Theorem 3.3 (3). 


We briefly consider maxima that have not been normalised and have been nor- 
malised for various sample sizes n = 1, 7, 30, 365, 3650 (Fig. 3.3). The left plot 
of Fig. 3.3 shows the distribution of the maxima where the lines represent, from left 
to right, each of the sample sizes. The right hand side of the same plot shows how 
quickly, the normalised maxima go to the Gumbel distribution; the exponential dis- 
tribution belongs to the Gumbel domain of attraction. The appropriate normalising 
constants for F standard exponential are a, = 1 and b, = log(n). Deriving this is 
left as an exercise. 
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Fig.3.3 Distributions of maxima (left) and normalised maxima (right) of n = 1, 7, 30, 365, 3650 
standard exponential random variables, with the Gumbel d.f. (solid heavy line) 
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Fig. 3.4 Distributions of maxima (left) and normalised maxima (right) of n = 1, 7, 30, 365, 3650 
standard normal random variables, with the Gumbel d.f. (solid heavy line) 


Doing the same thing except now with F standard normal distribution, Fig. 3.4 
shows that the convergence is slow. In this case, a, = (log n)- V ? and b, = 
(log n)! — 0.5 (log ny (log logn + log47). As before, F standard normal 
belongs to the Gumbel domain of attraction. 

Theorem 3.5 gives which sequences a, and b, should be used to normalise maxima 
in order to ensure that F is in the maximum domain of attraction of a specific G4. 
Note that the values of a, and b, changes with the sample size, n. If y were known 
before hand, knowing the true value of the normalising constants may help with 
simulations or numerical experiments. However in practice, we do not know ^ and it 
must be estimated. Thus, we can use the von Mises condition give us a work around. 
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Theorem 3.6 (von Mises Condition). Let r(x) be the hazard function defined by 


f(x) 


d (3.14) 


r(x) = 


where f (x) is the probability density function and F is the corresponding d.f.. 


1. Ifx* = oo and lim, pœ x r(x) = 1/7 > 0, then F € D(® 4,4). 

2. If x" < œ and lim, 4 xr (x* — x)r(x) = —1/y > 0, then F € D(W.1,,). 

3. Ifr(x) is ultimately positive in the negative neighbourhood of x" , is differentiable 
there and satisfies lim, 4 x r(x) = 0, then F € D(A). 


The von Mises conditions given in Theorem 3.6 is particularly useful when one is 
interested in conducting simulations. We may sample from a known distribution F 
which readily gives us the probability density function, f. Thus, without knowledge 
of the appropriate normalising constants, the von Mises conditions allow us to find 
the domain of attraction of F. 

We have discussed the asymptotic theory of the maximum from a sample. Earlier 
we mentioned that in practice, we divide the data into blocks of length n and take 
the maximum from each block and conduct inference on them. The results we have 
discussed in this section, tell us what happens as the block size becomes infinitely 
large. The approach of sampling maxima from blocks is, unsurpsingly known as the 
Block Maxima approach. As [21] pointed out, the block maxima model offers many 
practical advantages (over the Peaks Over Threshold, Sect. 3.3). The block maxima 
method is the appropriate statistical model when only the most extreme data are 
available; for example, historically temperature data was recorded in daily minimum, 
average and maximum. In cases where the time series may have periodicity, we can 
remove some dependence by dividing the blocks in such a way that dependence may 
exist within the block but not between blocks. We will now consider an alternative 
but equivalent method. 


3.3 Exceedances and Order Statistics 


When conducting inference on the tail of a distribution, itis wasteful to consider only 
the most extreme observation. We may be able to glean valuable information by util- 
ising more than just the maximum. For such cases, we may study either exceedances 
over a (high) threshold (Sect. 3.3.1) or we may consider order statistics (Sect. 3.3.2). 
In each case, we get different limiting distributions. In what follows we will discuss 
what the limiting distributions are in each case and how they relate to the extreme 
value distributions and the results from Sect. 3.2. 
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3.3.1 Exceedances 


In this instance, the idea is that statistical inference is be based on observations that 
exceed a high threshold, u, i.e., either on X; or on X; — u provided that X; > u for 
i < n. The exact conditions under which the POT model hold is justified by second 
order refinements (cf. Sect. 3.4) whereas typically it has been taken for granted that 
the block maxima method follows the extreme value distribution very well. We saw 
this from the discussion from Fig. 3.4. For large sample sizes, the performance of the 
block maxima method and the peaks over threshold method is comparable. However, 
when the sample is not large, there may be some estimations where the Peaks over 
threshold (POT) model is more efficient [21]. 

Since we have familiarised ourself with the convergence of partial maxima, 
we now do the same for exceedance. We will show that appropriately normalised 
exceedances converge to the Generalised Pareto distribution. This is the POT model. 
The work on exceedances was independently initiated by [15, 22]. As before, we 
will start with definitions and then proceed to establishing the Generalised Pareto as 
the limiting distribution. 


Definition 3.2. Let X be a random variable with d.f. F and right endpoint x^. 
Suppose we have the threshold u < x”. Then the d.f., F,, of the random variable X 
over the threshold u is defined to be 

F(x +u) — F(u) F 


BUQEPOCOWESM aaa IE a O0Oxx-«x —u. (3.15) 
— F(u 


Definition 3.3. The Generalised Pareto distribution is defined as 


1-1 Ur, 0, 
W,(x) = domu Tf (3.16) 
1—e*, y=0 
where 
x>0, y z0, 
O0O<x<-l/y, vy «0. 


Note that, as for G}, the Generalised Pareto distribution also has scale and location 
parameters: W,.,, g(x) := W,((x — v)/). Further, for x with 1 + yx > 0, 


1+ log G4(x) = W,(x). 
Now that we have looked at the d.f. of exceedance and defined the Gener- 


alised Pareto distribution, the latter can be established as the limiting distribution 
of exceedances. 
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Theorem 3.7 (Balkema-de Haan-Pickands Theorem). One can find a positive, mea- 
surable function p such that 


lim sup | |Fu(x) — GP), 0,4 (*)| = 0 (3.17) 


u Tx 9 c xc xF y 
if and only if F € D(G.). 


Not only does the Balkema-de Haan-Pickands theorem allow us to use more than 
the maximum, it also connects the d.f. of the random variables to that of exceedances 
over a threshold; from knowing the limiting distribution of F,, we also know about 
the domain of attraction of F and vice versa. The shape parameter in both cases 
is the same and thus their interpretation is the same as before, i.e., y describes 
the tail-heaviness of F if Eq. (3.17) is satisfied. Holmes and Moriarty [23] used the 
Generalised Pareto distribution to model particular storms of interest for applications 
in wind engineering and [24] used the POT method to analyse financial risk. 


3.3.2 Asymptotic Distribution of Certain Order Statistics 


Inthe previous section, we talked about how the POT approach can use data more effi- 
ciently. The efficiency relies on choosing the threshold appropriately. If the threshold 
is too low, then the exceedances are no longer from the tail and the bias is dominant. 
On the other hand, if the threshold is too high, then very few data points exceed it 
and the variance is high and confidence in the results is low. We can consider this 
idea of balancing the effects of bias and variance by considering a certain kind of 
order statistics. This is the topic of this section. 

Suppose X1, X5, ..., Xn, ... are iid. random variables with common d.f. F. If 
we take a finite sample X,,..., X, and order it from minimum to maximum, then 
we get the nth order statistics: 


Xin = Xan Ses Xpn- (3.18) 


Furthermore, we can define the kth upper order statistic, Xn—k,n, to be the kth largest 
value in the finite sample; the nth upper order statistic, i.e. k = n is the maximum 
and the first upper order statistic, i.e. k = 1, is the minimum. Depending on k and its 
relation to n, the kth upper order statistic can be classified in at least three different 
ways which leads to different asymptotic distributions. Arnold et al. [13] classified 
X, 4,5 to be one the following three order statistics: 


1. Central Order Statistics: Xn—k,n is considered to be a central order statistic if 
k = [np] + 1 where 0 < p < 1 and [-] characterises the function which is the 
smallest integer larger than the argument. 

2. Extreme Order Statistics: Xn—k,n is an extreme order statistic when either k or 
n —kis fixed and n — oo. 
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3. Intermediate Order Statistics: X, 4, is an intermediate sequence if both k and 
n — k approach infinity but k/n — 0 or 1. In this book we present results for 
k/n — 0 and we also assume that k varies with n i.e. k = k(n). 


Note that the conditions which ensure that X, ; ,, is an intermediate order statistic 
has similar notions of balancing bias and variance; insisting that k/n — 0 means that 
all data points larger than X„—ķ,n is a small part of the entire population and ensures 
analyses pertains to the tail of the distribution. However, for asymptotic results to 
hold, some of which we have seen in Sects. 3.2 and 3.3.1 and will see in this section, 
we require a large enough sample, i.e. k should go to infinity. As such identifying 
k appropriately is a crucial and a non-trivial part of extreme value analysis and 
also proves useful for the POT model as it allows us to chose u to be value which 
corresponds to the intermediate order statistics, X, ;,. 

Since we use intermediate order statistics in our case study on electricity load 
in Chap. 5, it is of more immediate interest to us but for the sake of completeness 
and intuitive understanding we discuss the asymptotic distribution of all three order 
statistics. First, we consider the convergence of the kth upper order statistics. 


Theorem 3.8. Let F be a d.f. with a right (left) endpoint x" < oo (xp > —oo) and 
k — k(n) be a non-decreasing integer sequence such that 


lu eB See [0, 1]. 


noo n 


1. Then X, 4), 2$ xF (xp) as c — 0 (c= 1). 
2. If we instead assume that c € (0, 1) is such that there is unique solution x(c) of 
the equation F(x) — c. Then 


a.s. 
Xn—k(n),n = x(c). 


Note that result 3.8 of Theorem3.8 relates to intermediate order statistics whereas 
result 3.8 relates to central order statistics. The proof is simple and can be found in 
[12]. We now proceed to the discussion of the asymptotic distribution for each of the 
order statistics. 


Theorem 3.9 (Asymptotic distribution of a central order statistic). For 0 < p < 1, 
let F be an absolutely continuous d.f. with density function f which is positive at 
F< (p) and is continuous at that point. For k = [np] + 1, as n > co, 


—k,an T F*(p) d 


Xn 
An f CF (p)) > N(O, 1), (3.19) 
pd — p) 


where A (u, 0?) denotes a normal distribution with mean j and variance c?. Thus 
note that the central order statistics, when appropriately normalised, converges to the 
normal distribution. This property is known as asymptotic normality and is particu- 
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larly desirable for estimators as it allows for the construction of confidence intervals 
with relative ease. The proof of Theorem 3.9 can be found in [13]. 

We can consider the asymptotic distribution of the extreme order statistics (also 
known as the upper order statistic) which no longer exhibits asymptotic normality. 
Instead, in this case, we recover links to the Generalized Extreme Value distribu- 
tion, G. 


Theorem 3.10 (Asymptotic distribution of an extreme order statistic). For any real 


x, P(X, 4 € anx + bn) > G(x) asn > oo if and only for any fixed k, 


k-1 
P(Xn-k+1,n < anx + bn) = > G(x) 
j-0 


(— log = Q9 (3.20) 


for all x. 


The proof can be found in [13]. Note that F being in the domain of attraction 
of an extreme value distribution implies that Eq. (3.20) holds with the same a,, and 
b,, and thus establishes a strong link between the asymptotic behaviour of extreme 
order statistics and the sample maxima. However, when k is allowed to vary with n 
as for intermediate order statistics, we again acquire asymptotic normality. 


Theorem 3.11 (Asymptotic distribution of an intermediate order statistic). Suppose 
the von Mises condition given in Theorem 3.6 holds for some G.,. Suppose further 
that k > oo and k/n — 0 as n — oo. Then 


Xn—kn = UC) 


Vk 
uq 


2 N(,1). (3.21) 


A proof for Theorem 3.11 can be found in [25]. Thus we see that although inter- 
mediate order statistics is somewhere between central order statistics and extreme 
order statistics and intuitively closer to the latter, its asymptotic behaviour is more 
akin to that of the central order statistics. Theorem 3.11 also gives us the appropri- 
ate normalisation. We now consider an example as it will demonstrate how to use 
Theorems 3.9 and 3.11. It is also useful for numerical simulation. 

Similarly, Theorems 3.9 and 3.10 can be used to choose appropriate normalisation 
for the relevant order statistics. Of course, in the above example, we have readily 
applied Theorem 3.11. In practice, we will need to check the von Mises condition or 
other relevant assumptions. This is taken for granted in the above example. 

Order statistics are particularly useful as they are used to build various estimators 
for y and x^. The commonly used Hill estimator for y > 0, is an example as is the 
more general Pickands estimator. 
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3.4 Extended Regular Variation 


We have already alluded to the topics in this section however due to the technical 
complexity, it is given only at the end. The theory of regular variation provides us a 
tool box for understanding various functions that we have come across. Moreover, to 
set the theory that we have discussed within a wider framework, stronger conditions 
are necessary. These conditions follow readily if we are familiar with the theory of 
regular variation. The topics in this section may seem disjointed and irrelevant but 
in fact, it is instrumental to making extreme value theory as rich and robust as it is. 
We will start with the fundamentals. 


Definition 3.4. Let £ be an eventually positive function on R+. Then £ is said to be 
slowly varying if and only if 
.  £(ux) 
lim =1 
uoo (x) 


Similarly, we can offer a more general version as follows. 


Definition 3.5. Let f be an eventually positive function on R,. Then f is said to be 
regularly varying with index p if and only if there exists a real constant p such that 


fy 2 = 


uo f(x) | 


(3.22) 


pis called the index of regular variation [notation: f € RV,]. Note that if f satisfies 
Eq. (3.22) with p = 0, then f is slowly varying. Strictly speaking the above defini- 
tions require f : Ry — R to be Lebesgue measurable. We can readily assume this 
as most functions in our case are continuous and thus satisfy Lebesgue measurabil- 
ity. Note also that all regularly varying functions f can be written in terms of the 
a slowly varying function £, i.e., if f € RV,, then f(x) = x?€(x) where £ € RVo. 
Note then that in Theorem 3.3, the tail of F was regularly varying in both the Fréchet 
and Weibull cases. 

We can make this even more general by considering functions that are of extended 
regular variation and/or belonging to a class of functions denoted by II. 


Definition 3.6. A measurable function f : R} — Ris said to be of extended regular 
variation if there exists a function a : R+ — IR, such that for some o € R\{0} and 


for x > 0, 
m PUD E 
im = . 


u — oo a(x) Q 


(3.23) 


[Notation: f € ERV,]. The function a is the auxiliary function for f. While we do 
not show this, a € RV,. We can see now observe that F € D(G,) => U e ERV, 
with auxiliary function a(t) (cf. Theorem3.2). Not only this but we can link f to be 
regularly varying as follows. 
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Theorem 3.12. Suppose f holds with Eq. (3.23), with a 4 0. Then 


1. Ifa > 0, then lim, o, f (x)/a(x) = l/a and hence f € RVa. 
2. Ifa < 0, then f (co) := lim, > œ f (x) exists, hm, > œ( f (00) — f(x))/a(x) = 
—]/a and hence f (co) — f(x) € RVa. 


The proof can be found in Appendix B of [18]. Since we now have the relation 
between the normalising constants and EVI with the index of regular variation, it can 
be used to construct estimators for the EVI. It can also be used in simulations where 
the true value is known or can be calculated. 


Definition 3.7. A measurable function f : R} — R is said to belong to the class II 
if there exist a function a : R4 — R+ such that, for x > 0, 


tm [09 IO 
m ——————— — 


zl : 
u— oo a(x) 08x 


where a is again the auxiliary function for f [Notation: f € I or f e II(a)]. In this 
case, a is measurable and slowly varying. Note that functions that belong to class II 
are a special case of functions which are of extended regular variation, i.e. where the 
index is 0. Next we consider Karamata's theorem which gives us a way to integrate 
regularly varying function. 


Theorem 3.13 (Karamata's Theorem). Suppose f € RV. There exists tọ > 0 such 
that f (t) is positive and locally bounded for t > to. If a > —1, then 


. tf (t) 
l NEA i idle t 1. 3.24 
DAT pas Ttt BS 


lfa < —l1, or a = —1 and fo f(s)ds < oo, then 


1 tft) | 
im o6 = 
t => œ i f (s)ds 


(3.25) 


Conversely, if Eq. (3.24) holds with —1 < a < oo, then f € RV. If Eq. (3.25) holds 
with —oo <a < —1, then f € RV,. 


Note that in Theorem 3.13, the converse for o, = —1 does not necessarily imply that 
f is regularly varying. 

It is obvious how the definitions and theorems we have looked at so far are 
relevant; we have provided examples of functions that were used in the report that 
satisfy one or more definition. Recall that in Sect. 3.3, we made mentions of second 
order refinements. The next part, though glance rather terse at first glance, provides 
a good source of valuable information to the prediction of distinctive features in 
extreme data. We shall look further at extended regular variation of U in Eq. (3.10) 
(i.e., Eq. (3.6) specialised in U) to give thorough insight as to how the normalised 
spacings of quantiles attain the GPD tail quantile function in the limit. The second 
order refinement below seeks to address the order of convergence in Eq. (3.10). 
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Definition 3.8. The function U is said to satisfy the second order refinement 
if for some positive function a and some positive or negative function A with 
lim; > œ A(t) = 0, 


U(tx)-U()) _ x?—1 


p a(t) EC 
im — c NEN =: A(x), x > 0, (3.26) 


where H is some function that is not a multiple of D} :— (x^ — 1)/+. 


The non-multiplicity condition is merely to avoid trivial results. The functions a 
and A may be called the first-order and second-order auxiliary functions, respec- 
tively. As before, the function A controls the speed of convergence in Eq. (3.10). The 
next theorem establishes the form of H and gives some properties of the auxiliary 
functions. 


Theorem 3.14. Suppose the second order refinement Eq. (3.26) holds. Then there 
exist constants c1, c» € R and some parameter p < 0 such that 


H, (x) = af sv} f u” du ds+e f sYt?7lgs. (3.27) 
1 1 1 


Moreover, for x > 0, 


a(tx) 
im IM o 
t —> 00 A(t) 
A(tx) 
im = 
1— oo 


A(t) 


— x? x^—] 


(3.28) 
p 


The results are understood in continuity i.e. taking the limit as p and/or y goes to 
Zero. This gives us that 


F (Dy+p(x) — Dy(x)) + coDy +(x), pz 0,y £0 
Ay (x) = a (x7 log x — D,(x)) + c2D,(x), p=1,y #0 (3.29) 
4 (log x)? + c2 log x, p=0,7y=0. 


The case of y = 0 and/or p = 0 is interpreted in the limiting sense as log x. Without 
loss of generality the constants featuring in the above can set fixed atc; = 1 and c? = 
0 (cf. Corollary 2.3.4 of [18]). The parameter p describes the speed of convergence 
in Eq. (3.26): p close to zero implies slow convergence whereas if |p| large, then 
convergence is fast. The above theorem results from the work of [26]. Finally, we 
can provide the sufficient second order condition of von Mises type. 


Theorem 3.15. Suppose the tail quantile function U is twice differentiable. Define 
A(t) :— Pi — y + 1. If A has constant sign for large t, lim; > œ A(t) = 0, and if 
|A| € RV, for p x 0, then for x > 0, 
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U(tx)—U(t) _ x-1 
tU'(t) y 


im AQ) = Hy). 

These definitions and results may seem unrelated or arbitrary but in fact some of 
the proofs of other results borrow understanding from the theory of regular variation, 
and functions such as the tail quantile function U as seen as of extended regular 
variation. Thus, regular variation theory allows us to extend the theory of extremes 
much further in a very natural way, it enables a full characterisation at high levels 
of the process generating the data by looking at the asymptotic behaviour of the 
exceedances above a sufficiently high threshold. It also allows us to prove asymptotic 
normality for various estimators. Thus, though quite involved, it is a very useful 
tool in extreme value analyses and is highly recommended for the enthusiastic or 
mathematically motivated reader. 

In conclusion, extreme value theory gives us a broad and well grounded founda- 
tion to extrapolate beyond the range of available data. Using either sample maxima 
or exceedances over a threshold, valuable inferences about extremes can be made. 
These are made rigorous by the first order and second order conditioning, which 
are underpinned by the broader still theory of regular variation. Moreover, we have 
techniques to conduct these analyses even when conditions of independence and 
stationarity do not hold. These results have already been adapted to fields such as 
finance, flood forecasting, climate change. They are accessible to yet more fields, 
and in this book they will be adapted for electricity demand in low-voltage networks. 
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Chapter 4 A) 
Extreme Value Statistics E 


When studying peaks in electricity demand, we may be interested in understanding 
the risk of a certain large level for demand being exceeded. For example, there is 
potential interest in finding the probability that the electricity demand of a business or 
household exceeds the contractual limit. An alternative, yet in principle equivalent 
way, involves assessment of maximal needs for electricity over a certain period 
of time, like a day, a week or a season within a year. This would stem from the 
potential interested in quantifying the largest electricity consumption for a substation, 
household or business. In either case, we are trying to infer about extreme traits in 
electricity loads for a certain region assumed fairly homogeneous with the ultimate 
aim of predicting the likelihood of an extreme event which might have never been 
observed before. While the exact truth may be not be possible to determine, it may 
be possible come up with an educated guess (an estimate) and ascertain confidence 
margins around it. 

In this chapter, not only we will list and describe mainstream statistical method- 
ology for drawing inference on extreme and rare events, but we will also endeavour 
to elucidate what sets the semiparametric approach apart from the classical paramet- 
ric approach and how these two eventually align with one another. For details on 
possible approaches and related statistical methodologies we refer to [1, 2]. 

We hope that, through this chapter, practitioners and users of extreme value theory 
will be able to see how the theoretical results in Chap. 3 translate in practice and 
how conditions can be checked. We will be mainly concerned with semi-parametric 
inference for univariate extremes. This statistical methodology builds strongly on 
the fundamental results expounded in the previous section, most notably the theory 
of extended regular variation (see e.g. [3], Appendix B). 

Despite the numerous approaches whereby extreme values can be statistically 
analysed, these are generally classed into two main frameworks: methods for max- 
ima over fixed intervals (blocks) and methods for exceedances (peaks) over high 
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thresholds. The former relates the oldest group of models, arising from the seminal 
work of [4] as block maxima models to be fitted to the largest observations collected 
from large samples (blocks) of identically distributed observations. The latter has 
often been considered the most useful framework in practical applications due to the 
widely proclaimed advantage over the former of a more efficient use of the often 
scarce extreme data. 

The two main settings in univariate extreme value theory we are going to address 
are: 


The block maxima (BM) method, whereby observations are selected in blocks of 
equal length (assumed large blocks) and perform inference under the assumption 
that the maximum in each block (usually a year) is well approximated by the 
Generalised Extreme Value (GEV) distribution. 

The peaks over threshold method (POT) enables to restrict attention to those obser- 
vations from the sample that exceed a certain level or threshold, supposedly high, 
and use mainstream statistical techniques such as maximum likelihood of method 
of moments to estimation and hypothesis testing, under the assumption that these 
excesses (values by each the threshold is exceeded) follow exactly a Generalised 
Pareto distribution (GPD). 


Application to energy smart meter data is an important part of the challenge, in 
the sense of the impeding application to extreme quantile estimation, i.e. to levels 
which are exceeded with only a small probability. A point to be wary of, when 
applying EVT is that, due to the wiggliness of the real world and the means by 
which empirical measurements are collected, observations hardly follow exactly an 
extreme value distribution. A recent application of extreme value theory can be 
found in [5]. Despite the underpinning theory to the physical models considered in 
this paper determines that the true probability function generating the data belongs 
to the Gumbel domain of attraction, the authors attempt statistical verification of 
this assumption with concomitant estimation of the GEV shape parameter y via 
maximum likelihood, in a purely parametric approach. They find an estimate of 
—0.002 which indicates that their data is bounded from above, i.e., there is a finite 
upper endpoint. However this is merely a point estimate whose significance must be 
evaluated through a test of hypothesis. In the parametric setting, the natural choice 
would be the likelihood ration test for the simple null hypothesis that ^; = 0. 

The semiparametric framework, where inference takes places in the domains of 
attraction rather than through of the prescribed limiting distribution to the data— 
either GEV or GPD depending on we set about to look at extremes in the data at our 
disposal—has proven a fruitful and flexible approach. 

In this chapter, we discuss the choice of max-domains of attraction within the 
semiparametric framework where an EVI y = 0 and finite upper endpoint are allowed 
to coexist. To this effect, we choose to focus on the POT approach as the methodology 
expounded here will be greatly driven and moderated by the statistical analysis of 
extreme features (peaks) exhibited by the Irish smart meter data. The description of 
the data has been given in Chap. 1. We recall that the data comprises 7 weeks of half- 
hourly loads (in kWh) for 503 households. Figure 4.1 is a rendering of the box-plots 


4 Extreme Value Statistics 63 


Weekly maxima 


10.0 


value 
— 
b se 
" 


5.0 


2.5 


0.0 


Sa qr ——————qp————————R——————R——— W——— 
Week16 Week17 Week18 Week19 Week20 Week21 Week22 
variable 


Fig. 4.1 Parallel box-plot representation of weekly maxima 


for each week being analysed. All seven box-plots look very similar, both in terms of 
median energy demand and dispersion of values of energy around the median. There 
is however one absolute extreme which we will be taking notice in the next sections. 
This value was recorded in Week 21. An important consideration at this point is that 
there is a physical limit to the individual electricity loads, which not any less imposed 
by limitations of the electrical supply infrastructure than by contractual constraint 
on the upper bound for an individual load. In statistical terms, this means that the 
assumption of a finite upper endpoint to the d.f. underlying seems fairly reasonable. 
Nonetheless, the data might indicated otherwise, suggesting that households are 
actually operating far below the stipulated upper bound, a circumstance that can be 
potentially exploited by the energy supplier so as to improve management of energy 
supply. 

Figure 4.2 is a scatter-plot representation of the actual observations in terms of the 
household number. With these two plots we intend to illustrate the two stated methods 
for the statistical analysis of extreme values, both BM and POT. The top panel shows 
all the data points. While proceeding with the BM method, one would take the largest 
observation at each h = 1,2,...,503, whereby one would have available a sample 
of 503 independent maxima. On the other hand, applying the POT method with 
the selected high threshold t = 7 kWh, we are left with fewer data points and more 
importantly with several observations originating from the same household. In this 
case, we find the observations fairly independent only because they are one week 
apart. We highlight that the POT method implies that some households are naturally 
discarded, which we find an important caveat to the POT-method, a method that has 
heralding the efficient use the available extreme data. 
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Fig. 4.2 Weekly maxima and exceedances above the threshold 7 kWh for the Iris smart meter data 
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4.1 Block Maxima and Peaks over Threshold Methods 


Due to their nature, semi-parametric models, are never specified in detail by hand. 
Instead, the only assumption made is that F is in the domain of attraction of an 
extreme value distribution, i.e. F € D(G,). In order to better understand what we 
mean by inference in extreme domains of attraction, let us remind ourselves of the 
well-known condition of extended regular variation [3, 6, 7], introduced in Chap. 3, 
as tantamount to the domain of attraction condition. Notably, F € D(G,) if and 
only if there exists a positive measurable function a such that the pertaining tail 
quantile function U € ERV,. The limit in (3.10) coincides with the U-function of 
the GPD, with distribution function 1 + log G.,, which justifies the usual inference on 
the excesses above a high threshold ascribed to the POT method. In fact, the extreme 
value condition (3.10) on the tail quantile function U is the usual assumption in semi- 
parametric inference for extreme outcomes. We shall develop this aspect further in 
Sect. 4.3. In the next Sect. 4.2 we will start off with yet another equivalent extreme 
value condition to the extended regular variation of U that is provided in [8] for 
dealing with block length and/or block number as opposed to looking at a certain 
of upper order statistics above a sufficiently high (random) threshold. Let V be 
the left generalised inverse of —1/ log F, i.e. V(-1/ log(1— t)) = F“ (1 — t). In 
other words, V(t) = F^ (e7!/*), for 0 < t < 1, which conveys standardisation to 
the standard Fréchet. It is straightforward to see that the df F underlying the random 
sample (X1, ..., X,) belongs to some max-domain of attraction if and only if there 
exist functions a and b, as defined in Theorem 3.2, such that 


V(x)—b(t) x'—1 
mo at) y 


: (4.1) 


for all x > 0. In contrast to the previous case of associating relation (3.1) with (3.10), 
there is now an asymptotically negligible factor creeping in when substituting b(t) 
with V (t). This is where we focus next, as this factor reflects the bias stemming from 
absorbing 5 (or V) into the location parameter of the GEV limit distribution (see 
3.2). The common understanding is that such bias is somewhat difficult to control, 
but we will have a closer look at this in terms of the second order refinements. The 
theoretical development for working out the order of convergence in Eq. (4.1) and 
(3.10) in tandem is given in Proposition 4.1. For the proof, we refer the reader to [9]. 


Proposition 4.1 Assume condition (3.10) (i.e. F € D(G,)) and that U is of extended 
regular variation of second order, that is, there exists a positive or negative function 
A with lim; , o5 A(t) = 0 and a non-positive parameter p, such that for x > 0, 
U(tx)9-U() _ x?-1 
É zu E x"—1 
1-700 A(t) p* wp ^ 


) =:H,(x). (42) 
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Define 
Sr, y¥#1,p<-l, 
. AQ) - n. yz pol, 
A(t) :— 4 A(t), p>-—lor(y=1, p> —2), 
A(t) + pt, y=, p=-2, 
pt, y= 1, p< -2. 


IfA is either a positive or negative function near infinity, then with 


a()(14 325517), yAl, p -l, 
a(t) := 4 a(t), p> -—l or (y= 1, p > —2), 
a()(1- 5) v-Lps-2 


the following second order condition holds 


V(x)-V(u) — x?-1 


. a(t) ¥ TE . 
lim. A) = A, p(x), (4.3) 


for all x > 0, where p = max(p, —1) ify € 1, and p = max(p, —2) ify = 1. 


We now provide four examples of application of Proposition 4.1 alongside further 
details as to how the prominent GPD can, at a first glance, escape the grasp of this 
proposition. 


Example 4.1 


Burr(1, 7, ^). This example develops along similar lines to the proof of Proposi- 
tion 4.1. The Burr distribution, with d.f. 1 — (1 +.x7)~4, x »0,A, T > 0, provides 
a very flexible model which mirrors well the GEV behaviour in the limit of linearly 
normalised maxima, also allowing a wide scope for tweaking the order of conver- 
gence through changes in the parameter À. The associated tail quantile function is 
U (t) = (1^ — 1/7, t > 1. Upon Taylor's expansion of U, the extreme tail condi- 
tion up to second order (Eq. 4.2) arises: 


1 
xi —] 


U(tx) — U(t) = x r ATIA (x 3G7D _ 1) eco]. 
AT 


as t — oo. Whence, the second order condition on the tail given in Eq. (4.2) holds 
for y = 1/(A7) and p = —1/A, y + p z 0, with 


a(t) = MC = C - ir) and A(t) = +(- - De: = (y+ pt’. 
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Proposition 4.1 is clearly applicable and therefore the Burr distribution satisfies 
the extreme value condition of second order (Eq.4.3) with y = 1/(y7) and p= 
max(—1/A, —1) if T #1. 


Example 4.2 


Cauchy. The relevant d.f. is F(x) — + arctan x + 5 x € The corresponding tail 
quantile function is U(t) = tan(1/2 — n/t) = t/r —1/3t7! + O(t), as t > 
oo, and admits the representation U(tx) — U(t) = L[x —-1- Eye —1)+ 
O(t^^)], x > 0. Hence, we have that y = 1, p = —2 in Eq. (4.2) with auxiliary 
function a(t) = t/n. Proposition (4.1) thus ascertains that (Eq. 4.3) also holds true 
for the Cauchy distribution where y = 1 and p = —2. 


Example 4.3 


GPD(). The relevant d.f. is W,(x) = 1 — (1 + *yx)- ^^, for all x such that 1 + y > 
0. The pertaining tail quantile function is U (t) = (t? — 1)/y which is also born 
out of the exact tail condition (3.10). Clearly, U does not satisfy the second order 
condition (Eq. 4.2) in a straightforward fashion, however we are going to show that 
the corresponding V (t) = U(1/(1 — e V1) satisfies (Eq. 4.3). To this end, we shall 
deal with the cases ^; = 1 and y Z 1 separately. 


Case y = 1: Applying Laurent series expansion upon (1 — e~!/‘)~!, we get 


2 2 
7-2) + O(t 7), 


1 
Vitx) — V(t) = (1 ae D+ 


as t — oo. Whence, the second order condition (Eq. 4.3) holds with y = 1 and 
p = —2, where A(t) = t7?/6 and X(t) = t(1 + A(t)/p). 
Case y Æ 1: Upon Taylor's expansion around zero, we obtain 


y-1y\x7-1 «w-1 3 
Vex) -VO=n| (1 ) H, or), 
(13) - VO [rs y) ae eJ« G 
as t — oo. Whence, the second order condition (Eq. 4.3) holds with p = —1, 
where tA (t) = (1 — y)/2 and a(t) = t^(1 + A(0)/p). 


Therefore, the GPD verifies Proposition 4.1 if one tunnels through the consideration 
that the GDP satisfies (Eq. 4.2) with p = —oo. 
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Example 4.4 


Pareto(o). This distribution is a particular case of the GPD d.f. in Example 4.1 with 
y = 1/a > 0 and U (t) = t!/, that is U does not satisfy the second order condition 
(Eq. 4.2) and thus Proposition 4.1 stands applicable provided similar interpretation 
to Example 4.1. 


Example 4.5 


Contaminated Pareto(o). We now consider the Pareto distribution with a light 
contamination in the tail by a slowly varying function L(t) = (1 + log t), that is, 
L(tx)/L(t) > 1, as t > oo, for all x > 0. This gives rises to the quantile function 
U(t) = t'/°(1 + logt), with a > 0. For the sake of simplicity, we shall use the 
identification y = 1/a. With some rearrangement, we can write the spacing U (tx) — 
U (t) in such a way that the first and second order parameters in condition (Eq. 4.2), 
both y and p < 0, crops up: U (rx) — U(t) = yt (logt + D[(1 + mem) + 
Teer Hs (x)], where H, o(x) := nee logx — E). Note that we have provided 
an exact equality, i.e. there is no error term. We thus find that tampering with the 
Pareto distribution, by contaminating its tail-related values with a slowly varying 
factor, is just enough bring the convergence (Eq.4.2) to a halt which is flagged- 
up by the lowest possible p = 0. This stalling of the Pareto distribution enables to 
fullfil the conditions in Proposition (4.1) thus ensuring that this contaminated Pareto 
distribution belongs to the max-domain of attraction of the GEV distribution with 
y= l/a > O0 and p = 0. 


4.0. Maximum Lq-Likelihood Estimation with the BM 
Method 


Let us define the random sample consisting of k i.i.d. block maxima as 


Mi= max Xj i=1,2,...,k, m=1,2,... (4.4) 


(i—l)m<j<im 


The above essentially states that we are dividing the whole sample of size n into 
k blocks of equal length (time) m. For the Extreme Value theorem to hold within 
each block, the block length must be sufficiently large, i.e. one needs to impose m 
tending to infinity to able to proceed with inference. We are then led to the reasonable 
assumption that the sample of k-maxima behaves approximately as though it stems 
from the GEV. 

Under a semi-parametric approach, maximum likelihood estimators for the vector- 
valued parameter 0 = (u, c, y) are obtained by pretending (which is approximately 
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true) that the random variables Mi, M», ..., My are independent and identically 
distributed as maxima of the GEV distribution with d.f. given by 


TEE 


for those x such that c + y(x — u) > 0. The density of the parametric fit to the BM 
framework is the GEV density, which we denote by gg, may be differ slightly from 
the true unknown p.d.f. f underlying the sampled data. We typically estimate these 
constants a (mn) and b(m) via maximum likelihood, despite these being absorbed into 
the scale c > 0 and location js € parameters of the parametric limiting distribution 
thus assumed fixed, eventually. As a result, BM-type estimators are not so accurate 
for small block sizes since these estimators must rely on blocks of reasonable length 
to fulfill the extreme value theorem. 

The criterion function £, that gives rise to a maximum Lq-likelihood (MLq) 
estimator, 


k 
0 :— arg max 3 Ulga), 


i-l 
for q > 0, makes use of the Tsallis deformed logarithm as follows: 


c T 


q 2-0, (4.5) 


This MLq estimation method picks up the standard maximum likelihood estima- 
tor (MLE) if one sets the distortion parameter q = 1. This line of reasoning can 
be stretched on to a continuous path, that is, as g tends to 1, the MLq estimator 
approaches the usual MLE. The common understanding is that values of q closer 
to one are preferable when we have numerous maxima drawn from large blocks 
since this will give enough scope for the EVT to be accessible and applicable. In 
practice, we often encounter limited sample sizes n = m x k in the sense that either 
a small number of extremes (k sample maxima) or blocks of insufficient length m 
to contain even one extreme are available. MLq estimators have been recognised as 
particularly useful in dealing with small sample sizes, which is often the situation in 
the context of the analysis of extreme values due to the inherent scarcity of extreme 
events with catastrophic impact. Previous research by [10, 11] shows that the main 
contribution towards the relative decrease in the mean squared error stems from 
the variance reduction, which is the operative statement in small sample estimation. 
This is in contrast with the bias reduction often sought after in connection with large 
sample inference. Large enough samples tend to yield stable and smooth trajectories 
in the estimates-paths, allowing scope for bias to set in, and eventually reflecting 
the regularity conditions in the maximum likelihood sense. Once these asymptotic 
conditions are attained, the dominant component of the bias starts to emerge, and 
by then it can be efficiently removed. This is often implemented at the expense of 
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Fig. 4.3 Estimation of the EVI with the BM method 


an increased variance, for the usual bias/variance trade off in statistics seems never 
to offer anything worthwhile on one side without also providing a detriment to the 
other. 

Furthermore, we will address how the maximum likelihood compares with the 
maximum product of spacings (MPS) estimator in this case study. The MPS estimator 
of 0 maximises the product of spacings 


k+1 k+1 


[20 =[] {Goin - Goi}, 
i=1 = 


i-l 
with Gọ(xo,x) = 0 and Ge(xy.14) = 1, or equivalently the log-spacings 


k+1 
LMP5(9. x) = >. log D;(0). (4.6) 


i=1 


The MPS method was introduced by [12], and independently by [13]. A generalisa- 
tion of this method is proposed and studied in great depth by [14]. The MPS method 
was further exploited by [15] in estimating and testing for the only possible three 
types of extreme value distributions (Fréchet, Gumbel and Weibull), all unified in 
the GEV distribution. 

Figure 4.3 displays the sample paths of the adopted extreme value index estima- 
tors, plotted against several values of the distortion parameter q € [0.5, 1]. As q 
increases to 1, the deformed version of both ML and MPS estimators approach their 
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classical counterpart which stem from stipulating the natural logarithm as criterion 
function. The estimates for the EVI seem to consolidate between —0.14 and —0.13. 
The negative estimates obtained for all values of g provide evidence that the true d.f. 
F generating the data belongs to the Weibull domain of attraction and therefore we 
can reasonable conclude that we are in the presence of a short tail with finite right 
endpoint. The next section concerns the estimation of this ultimate return-level. 


4.2.1 Upper Endpoint Estimation 


A class of estimators for the upper endpoint x^ stems from the extreme value condi- 
tion (Eq.4.1) via the approximation V (co) © V (m) — a,,/^y, as m — oo, by notic- 
ing that V (oo) = lim;_,.. V(t) = F^ (1) = x. The existing finite right endpoint 
x” can be viewed as the ultimate return level. When estimating extreme character- 
istics of this sort, we are required to replace all the unknowns in the above by their 


empirical analogues, yielding the estimator for the right endpoint: 


i = Vim) — m. (4.7) 


where quantities à, V and ^ stand for the MLq estimators for the scale and location 
functions a (m) and V (m), and for the EVI, respectively. 

Figure 4.4 displays the endpoint estimates for several values of q < 1 with respect 
to the tilted version of both ML and MPS estimators. The latter always finds larger 
estimates than the former, with a stark distance from the observed overall maximum. 
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Since the adopted estimators do not seem to herd towards one value, it is not easy to 
conciliate between them. Given the maximum likelihood estimator has been widely 
used and extensively studied in the literature, it is sensible to ascribe preference to 
this estimator. Furthermore, since we are dealing with small sample sizes (we are 
taking the maximum over 7 weeks), the distorted version, i.e., the MLq estimator 
must be taken into account. Therefore, we find reasonable to take as estimate for 
the upper bound the midpoint of the region where the MLq changes way and travels 
across towards the plain ML estimator with increasing g. Thus, we get an upper 
bound of 14.0kWh, approximately. 


4.3 Estimating and Testing with the POT Method 


In Sect. 4.1, we noticed that the function appearing in the limit of the extended regular 
variation of U matches the tail quantile function of the Generalised Pareto distribu- 
tion. This fact reflects indeed the exceptional role of the GPD in the extreme value 
theory for exceedances [16, 17] and prompts the need for classifying of the tails of 
all possible distributions in D(G,) into three classes in accordance to the sign of the 
extreme value index y. For positive y, the power-law behaviour of the underlying tail 
distribution function 1 — F has important implications one of which being the pres- 
ence of infinite moments. Because when y > 0 the first order condition (3.10) can be 
rephrased as lim; , oo U (tx)/U (t) = x^, for all x > O, that is, U is y-regularly vary- 
ing at infinity (notation: U € RV,), then Karamata's theorem ascertains that E (X D 
is infinite for p > 1/y, where X} = max(0, X;). Hence, heavy-tailed distributions 
in a max-domain of attraction not only have an infinite right endpoint, but also the 
order of finite moments is determined by order of the magnitude of the EVI y > 0. 
The Fréchet domain of attraction contains distributions with polynomially decay 
tails such as the Pareto, Cauchy, Student's and Fréchet itself. All d.f.'s belonging 
to D(G,) with y < 0—Weibull domain of attraction—are light tailed distributions 
with finite right endpoint. Such domain of attraction encloses Uniform and Beta 
distributions. The intermediate case y = 0 is of particular interest in many applied 
sciences where extremes are relevant. At a first glance, the Gumbel domain of attrac- 
tion seems quite appealing for the simplicity of inference in connection to Gy that 
dispenses the estimation of y. But a closer inspection allows to appreciate the great 
variety of distributions possessing such an exponential tail related behaviour, whether 
having finite upper endpoint or not. Normal, Gamma and Lognormal distributions 
can all be found in Gumbel domain. The negative Fréchet distribution also belongs to 
the Gumbel domain, albeit with finite endpoint (cf. [18]). Therefore, a statistical test 
for assessing significance of the EVI would be of great use and most consequential. 
Looking for the most propitious type of tail before estimating tail-related features of 
the distribution underlying the data can mark the difference between aiming at the 
estimation of extreme quantile or sprinting to the estimation of the upper endpoint 
estimation. In fact, adopting tailored statistical methodology to the suitable domain 
of attraction for the underlying d.f. F has become a regular practice. 
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4.3.1 Selection of the Max-Domain of Attraction 


A test for Gumbel domain versus Fréchet or Weibul max-domains has received in 
the literature the general designation of statistical choice of extreme domains of 
attraction. References in this respect are [19—29]. 

The present section primarily deals with the two-sided problem of testing Gumbel 
domain against Fréchet or Weibull domains, i.e., 


F € D(Go) vs F € D(G.)),uo. (4.8) 


Bearing on the sample maximum as the dominant and most informative statistic in 
any analysis of extreme value, we shall consider the ratio statistic 


Xn,n uni Xn—kn 


T* (k) := T,(k) — logk = logk, | (49) 


TM 


i (X ids mE Xi-ks) 


i=l 


for the testing problem in Eq. (4.8). According to the asymptotic results stated in 
[27], the null hypothesis F € D(Go) is rejected, against the bilateral alternative 
F € D(G,)yz0, atan asymptotic level a € (0, 1) if 77 (k) < gaj20r T; (k) > 81-a/2, 
where g; denotes the e-quantile of the Gumbel distribution, i.e., g- = — log(— log £). 
One sided tests are also within reach of this test statistic. We reject the null hypothesis 
in favor of either unilateral alternatives H, : F € D(G,)y<o or H; : F € )0(G4)4s0 
on either side T* (k) < gq or T7 (k) > gi—a, respectively. Neves et al. [27] show that 
this statistic provides a consistent test to discriminate between light tails and heavy 
tails, departing from the null hypothesis of an exponential tail. 

Builton Shapiro-Wilk goodness-of-fit statistic, the well-known Hasofer and Wang 
test statistic, embodies the reciprocal squared empirical coefficient of variation asso- 
ciated with the sample of the excesses above the random threshold X,. ;,. More 


concretely, 
" 2 
(c X z) 
i=l 


, (4.10) 
k k ? 

kE zZ- (e »» zi) 
i=l i=l 


W, (k) = k 


where Zi := Xn-i+1,n — Xn—k.n, i = 1,...,k. Giving heed to the problem of the 
statistical choice of domains of attraction postulated in Eq. (4.8), we define for j = 
1,2 


k 
j 1 r 
NS = k » Cee I X» ka) (4.11) 
i=l 


and use it to write Hasofer and Wang, W,,(k), and Greenwood R,,(k) statistics in the 
following way: 
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(2) 
mie c h (4.12) 
(Nn? ()) 
1 | R,(k) — 2 | 
wk) = -l1 ; (4.13) 
kl 14+ (Ril) —2 


Considering, as before, the k upper order statistics from a sample of size n such that 
k = ką, is an intermediate sequence, i.e., k — oo and k/n — 0 as n — oo, define 


Rv (k) := Vk/4 (Rn(k) — 2) (4.14) 
W*(k) := V/k/4 (KW, (k) — 1). (4.15) 


These normalized versions, R*(k) and W,*(k), are eventually the main features to 
take part in the testing procedure. The critical region for the two-sided test of nominal 
size o is given by |77 (k)| > Zi-a/2, with ze denoting the e-quantile of the standard 
normal distribution and where T has to be conveniently replaced by R or W. 

In addition, one-sided testing problems 


F € D(Go) vs F € D(G44-o (or F € D(G,)ys0); 


can also be tackled with both these test statistics. Here, the null hypothesis is rejected 
in favor of either unilateral alternatives H, : F € D(G,), <0 or H; : F € D(Gy)ys0 
on either side T* (k) < Za or T; (k) > Z1—a, respectively, with z- denoting again the 
e-quantile of the standard normal distribution and whereas T has to be conveniently 
replaced by R or W. 

We remark that the test based on the very simple Greenwood-type statistic R* is 
shown to good advantage when testing the presence of heavy-tailed distributions is 
in demand. While the R*-based test barely detects small negative values of y, the 
Hasofer and Wang’s is the most powerful test under study concerning alternatives 
in the Weibull domain of attraction. The three testing procedures will be illustrated 
with the Iris smart meter data in the next section. 


4.3.2 Testing for a Finite Upper Endpoint 


The aim of this section is to assess finiteness in the right endpoint of the actual d.f. 
F underlying the Irish smart meter data. The basic assumption is that F belongs to 
some max-domain of attraction. We then consider the usual asymptotic setting, where 
assume an intermediate number k of order statistics to drawn inference upon, that 
is, we take k = k, — oo and k,/n — 0, as n — co, and hence the corresponding 
random threshold X, ,, — xë a.s.. 

The statistical judgment on whether a finite upper bound exists finite will be 
informed by the testing procedure introduced in [30]. Notably, the testing problem 
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can be tackled using the log-moments 


k—1 
1 
My: z 9 (log Xnin = log Xara) > j-12. (4.16) 
i=0 


The test statistic T; being used is defined as 


k wR 

1 Xn—-in — Xn-kn — T M M 

T= y n—i,n n—k,n with T :— d: De ( = [ 1] ) 
k i=l Xan = X, kn 2 


Under Hp, the standardised version of the test, Ty :— Vk logk Ti, is asymptotically 
normal. Moreover, T;* tends to inflect to the left for bounded tails in the Weibull 
domain and to the right if the underlying distribution belongs to the Gumbel domain. 
The rejection region of the test is given by |T¥| > z1-a/2, for an approximate o sig- 
nificance level. Figure 4.5 displays the sample path of 7,*, labeled TestEP, alongside 
the observed values of the three test for selecting max domain of attraction presented 
in Sect. 4.3.1. The horizontal grey lines mark the critical barriers for the one-sided 
test at a a = 5% significance level. When these critical bounds are crossed, the null 
hypothesis of the Gumbel domain is rejected in favour of the Weibull domain. The 
statement for the testing problem on the upper endpointis slightly different as it entails 
a different breakdown of the Gumbel domain. The choice of the optimal number k of 
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intermediate order statistics is of paramount importance to any inference problem in 
extremes. Many methods have been proposed, but sadly there is not universal solu- 
tion that can hold for the multitude of estimators and testing procedures available. 
Here, we loosely follow the guideline of [23] in that the most adequate choice of 
the intermediate number k (which carries over to the subsequent semi-parametric 
inference) should set on the lowest k at which the critical barriers are overtaken. The 
ratio test (Eq. 4.9), which is known to be the most conservative of all three tests for 
choice of domains, does not reject the null hypothesis of the Gumbel domain since 
the green trajectory remains above the second horizontal line from below, for all 
intermediate values of k considered. We have remarked that the Hasofer and Wang 
(HW) test defined in Eq. (4.13) is the most powerful test for detecting distributions 
in the Weibull domain of attraction. The application of the HW test seems to do 
justice to this Irish smart meter data set and finds sufficient evidence to reject the null 
hypothesis of the Gumbel domain in favour of entertaining estimation procedures 
suited to the Weibull domain. Therefore we will proceed on to the estimation of the 
finite upper bound in the POT framework. This will be tackled in the next section. 


4.3.3 Upper Endpoint Estimation 


Along similar lines to Sect. 4.2, a valid estimator for the upper endpoint x^ = U (oo) 
arises by making t = n/k in the approximate equality corresponding to (3.10), and 
then replacing U(n/k), a(n/k) and y by suitable consistent estimators, i.e. 


oF e(t) = à(1) 
x ; 5 


(cf. Sect. 4.5 of [3]). Typically we consider the semiparametric class of endpoint 
estimators as follows: 


a(n/k 
CL MEN LN (4.17) 
4 


For the application of EVT to the class (Eq. 4.17) of upper endpoint estimators it 
is thus necessary to estimate the parameter y. We mention the estimators: Pickands’ 
estimator [17] for y € IR; the so-called ML estimator [31], valid for y > —1/2; the 
moment estimator [32] and the mixed moment estimator, both valid for any y € R. 
These moment estimators are purely semiparametric estimators for they are devised 
upon conditions of regular variation underpinning the max-domain of attraction char- 
acterisation. Since these are rather qualitative conditions, with functions U and a 
specific to the underlying d.f. F, then inference must be built on summary statistics 
that can capture the most distinctive traits in tail-related observations. The method 
of moments is ideally suited to this purpose. 

In order to develop a novel estimator for the extreme value index y € R, [33] con- 
sidered a combination of Theorems 2.6.1 and 2.6.2 of [7] and went on with replacing 
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F with its empirical counterpart F, and t by the order statistic X, 4, with k < n. 


This led to the statistic 
MP (k) — LP (k) 


Ga = 5 (4.18) 
(LPW) 
where we define, for j = 1, 2, 
k-1 7 
; 1 X —k J 
LO (k) := (1 n “) 4.19 
H=: 3 v (4.19) 


and with M (k) given in Eq. (4.16). The statistic in (Eq. 4.18) is easily transformed 
into the so-called mixed moment estimator (MM) for the extreme value index y € IR: 


Ên (k) m 1 


AMM (k) .— : 
h W 1+2 min(2,( — 1,0) 


(4.20) 


The most attractive features of this estimator are: 


e like Pickands and Moment estimators, it is valid for any y € R and, contrary to 
the maximum likelihood estimator, it has a simple explicit functional form; 

e it is very close to the maximum likelihood estimator for y > 0; 

e Ify < 0, the asymptotic variance is the same as of the Moment estimator; if y > 0, 
its asymptotic variance is equal to that of the maximum likelihood estimator; 

e A shift invariant version with similar properties is available with the same asymp- 
totic variance without sacrificing the dominant component of the bias, which is 
never increased as long as we keep a suitable number k; 

e there are accompanying shift and scale estimators that make e.g. high quantile and 
upper endpoint estimation straightforward. 


Figure 4.6 shows the sample paths of several estimators of the EVI as the upper 
number k of order statistics embedded in the estimators increases, concomitantly 
lowering the threshold until the value 5 khW is reached. The standard practice for 
drawing meaningful conclusions from this type of plots is by eyeballing the trajec- 
tories and seek for a plateau of stability at the confluence of the adopted estimators. 
In the top panel of Fig. 4.6, the MLq estimator of the extreme value index y, which 
has no explicit closed form and thus delivers estimates numerically, experiences con- 
vergence issues. This is often the case with maximum likelihood estimation for the 
GPD when the true value is negative but close to zero. 

In the semi-parametric setting, whilst working in the domain of attraction rather 
than dealing with the limiting distribution itself, the upper intermediate order statistic 
Xn—kn plays the role of the high deterministic threshold u 4 x^ < oo above which 
the parametric fit to the GPD is applicable. For the asymptotic properties of the POT 
maximum likelihood estimator of the EVI under a semi-parametric approach, see 
e.g. [34-36]. Although theoretically well determined, even when » f 0, the non- 
convergence to a ML-solution can be an issue when » is close to zero. There are also 
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Fig. 4.6 Estimation of the EVI with the POT method 


irregular cases which may compromise the practical applicability of ML. Theoretical 
and numerical accounts of these issues can be found in [37, 38] and references therein. 

In the second panel in Fig. 4.6, we swap the MLq estimator with the MPS (or MSP) 
estimator for the GPD. Although there are issues in the numerical convergence for 
small values of k, where the variance is stronger, this estimator shows enhanced 
behavior returning estimates of the EVI in agreement with the remainder estimators. 
Therefore, it seems reasonable to settle with the point estimate 7 = —0.01. Itis worth 
highlighting that the MLq shows its best performance within the corresponding region 
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Fig.4.7 Estimation of the upper endpoint with the POT method 


of values of, that is, for k between 125 and 175, a region that also holds feasible for 
the tests expounded in Sect. 4.3.1. 

There is however one estimator for the upper endpoint x^ that does not depend on 
the estimation of the EVI y, worked out in [18], this being designed for distributions 
with finite upper endpoint enclosed in the Gumbel domain of attraction. The so-called 
general right endpoint estimator is defined as 


m 1 

AB , 
X = Xnn + Xn—k.n log 2 = log(I + t i Toer (4.21) 
Figure 4.7 displays the estimate-yields for several endpoint estimators in the class 
(Eq. 4.17), with accompanying general endpoint estimator. Again, the corresponding 
maximum Lq-likelihood estimator is found with the distortion parameter q being set 
equal to 0.88, where the k-values for which the Lq-likelihood method experienced 
convergence issues in the estimation of the EVI are now omitted. The value g = 1 
determines the mainstream ML estimator for the endpoint which the class of estima- 
tors defined in Eq. (4.17) also encompasses. The relative finite sample performance 
of these endpoint estimators is here compared with the naive maximum estimator 
Xnn- We recall that the observed maximum is 12.01. The general endpoint estima- 
tor consistently returns values around 12.4 for almost all values of k. All the other 
estimators, expect the MLq estimator for the upper endpoint seem to exacerbate the 
upper bound for the electricity load. Therefore, we find reasonable to advise that the 
estimate for the upper endpoint of < = 13.0kWh should be taken as benchmark for 
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future work concerning technologies to enable end use energy demand. We also issue 
the cautionary remark that this is a mere point estimate, disposed of any confidence 
bounds. 


4.4 Non-identically Distributed Observations—Scedasis 
Function 


Thus far we have assume that our data consists of observations of i.i.d random 
variables X4,..., X, are i.i.d random variables. In reality, this may not be the case. 
Thus far in this chapter, we have been treating the Irish smart meter data as if these 
comprise independent and identically distributed observations, but intuitively at least 
one of these assumptions may not hold. Despite the ways we are sampling extremes 
from the original raw data, through either BM or POT approaches, inevitably though 
a few households may end up as main contributors to the extreme values being 
taken up to the statistical analysis. Figure 4.8 is a representation of the excess load 
per household above 7 kWh. The top panel shows the exceedances in terms of their 
cumulative yields over 7 weeks, whilst the bottom panel is the same scatter plot 
already presented in Fig. 4.2. It is noticeable that the household yielding the absolute 
maximum of 12.01 kWh does not show the largest total in the cumulative excess 
plot but two other households are sparking up totals by consistently exceeding the 
threshold over the course ofthe 7 weeks. Despite this does not shift the upper endpoint 
itself (as this is kept fixed by structural settlements with the energy provider), it may 
push the observations closer to the true upper bound signifying a trend is present in 
the extreme layers of the process generating the data. 

Social demographics and behavioural changes are not likely to occur within the 
time span considered to this illustrative example, but we can see that in other applica- 
tions to electricity consumption even a decade is enough time for human behaviour 
to be vastly different. Regulation of the gas and electricity industry, introduction 
of software applications that can monitor and incentivise certain consumption over 
others, time-of-use tariffs, new low carbon technologies and the variety of electronic 
devices in homes will change the way consumers interact with the grid and consume 
their electricity and most probably has already changed. 

Nonetheless we still want to be able to address the same questions as we did 
before and we want to ensure that there is a probabilistic framework that grounds our 
statistical analysis. Our main concern lies with observations that are not identically 
distributed but we will include a short review for data that exhibit serial dependence. 

There are of course many ways to look at dependence in data sets indexed in time 
or space. Ways of pre-processing data to alleviate dependence issues and possible 
non-stationary have been considered in [39]. Historically, however, simpler clustering 
techniques have been employed [40]. We already discussed apropos to the BM, to 
choose the length of block in such a way that the dependence no longer plays a 
role or is weak for the extreme value theorem to hold. Ensuring that our observation 
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Fig. 4.8 Estimation of the upper endpoint with the POT method 


come from separate events is a simple way of ascertaining independence and that the 
sampled data contains meaningful information on the widely different phenomena 
emanating from similar hazards (earthquakes, rainfall, storms). For example, if we 
were considering heatwaves, they may last up to a fortnight thus considering daily 
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maxima in temperature taken during a heatwave will be part of the same weather 
system and subsequently dependent and also less representative of the wild. Similarly, 
if we are considering rainfall, a low pressure system may last two or three days thus 
maxima taken every 2—3 non-overlapping days may be considered to independent 
or only weakly dependent. In the same vein, we may look at weekly maxima of the 
electric load profiles of individual household to weed out the daily and sub-weekly 
patterns. Thus, the block to be chosen should be application and data specific. 

However, sometimes there is also temporal dependence in addition to individual 
events i.e. profiles change with time. For temperature data, the change comes with 
the season as well due to the diurnal cycle. Similarly, for electricity data there are 
many sources of seasonality to consider: the impact from temperature i.e., an annual 
cycle and the daily cycle as well as human behaviour which may repeat weekly. For 
example, we may restrict to only taking weekly maxima from the summer, etc. This 
is where we turn focus to non-stationarity extremes, meaning that the underlying 
distribution changes over time or across space or both. This aspect will be exploited 
through the definition of a trend in the frequency of extremes in such a way to maintain 
integrity as we move across the potentially different distribution functions ascribed 
to each of the considered time-intervals (or spatial-regions), assumed homogeneous 
within themselves and heterogeneous between them. The basic structural assumption 
to the trend on the time-space evolving probability that a certain large outcome is 
exceeded originates from the concept of comparable tails [41]. There have been 
estimators accounting for the heteroscedastic (non-stationarity in the scale) setting, 
first introduced by [42] and further developed by [43] to address challenged arising 
in the modelling of heavy-tails. 

The setup is laid out as follows. Suppose that X n ,..., XU? are observations from 
independent random variables taken at n time points, which are assumed to follow 
different distribution functions F, ;, ..., Fn,n but sharing a common upper endpoint 
denoted by x^. Suppose the following limit relation involving an offset or baseline 
distribution function F and a continuous positive function c taking values in [0, 1], 


L= Fi) i 
xx 1— FO) =<) nc, 


subject to the unifying condition 


1 
/ c(s)d(s) = 1. 
0 


Doing so allows c to represent the frequency of extremes in the tail. Einmahl et al. 
[43] advocate a kernel density estimator as ideally suited to tackle the estimation 
of the scedasis function which can be viewed as a density in itself. Specifically, the 
estimator for c(s) is given by 


" 1 «c t=; 
o= zx 3 fasse caet - ) (4.23) 
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where G is a continuous, symmetric kernel such that f^ G(s)ds = 1, with G(s) = 0 
for |s| > 1. The bandwidth, h :— A, satisfies h — 0 and kh — oo as n — œ, and 
I, = 1 denotes indicator function which is equal to 1 if A holds true and equal to 0 
otherwise. Finally we note that X, n- is the global random threshold determined by 
the kth largest observation. We shall defer the estimation of the scedasis in practice 
to Chap. 5 using the Thames Valley Vision data. 
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Chapter 5 A) 
Case Study get 


5.1 Predicting Electricity Peaks on a Low Voltage Network 


In the previous chapter, we looked at load measurements for all households together 
and we ignored their chronological order. In contrast, in this chapter, we are interested 
in short term forecasting of household profiles individually. Therefore, information 
about the time at which measurements were taken becomes relevant. 

To illustrate different popular methods and look at their errors, we use a subset 
of the End point monitoring of household electricity demand data from the Thames 
Valley Vision project,! kindly made publicly available by Scottish and Southern 
Energy Networks,” containing profiles for 226 households in 30 min resolution. 

We use the time window from Sunday, 19 July 2015 00.00 to Monday, 21 Septem- 
ber 2015 00.00. The first eight weeks (or less depending on the model) are used for 
training the models, and we want to predict the ninth week (commencing on the 14th 
September 2015), each half hour usage in that week for each of the households. 

On the Fig. 5.1, mean value (top) or maximum value (bottom) for each half-hour is 
computed for the day of the week for each household, and a box-plot over households 
is presented. On the Fig. 5.2, a box-plot is produced for each household over all the 
values recorded during the eight weeks of observations for that household. 

The data for each household consists of 3072 half-hourly values. We want to 
predict the last 336 value in each time series. The exploratory data analysisconfirms 
that there is a daily seasonality detected. The examples of seasonal decomposition 
using an additive seasonal model with a lag of 48 half-hours, i.e. one day, and auto- 
correlation and partial auto-correlation functions with lag 48 for two households 
are given on Fig. 5.3. In most cases, no uniform trend is observed, daily seasonal 
component usually contains morning and evening peak as expected, and residuals 


"https://www.thamesvalleyvision.co.uk. 
*http://data.ukedc.rl.ac.uk/simplebrowse/edc/Electricity/NT V V/EPM. 
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Fig. 5.1 Usage on different days 


mostly look random, as expected. The autocorrelation and partial autocorrelation 
inform us on how many past observations are relevant for predicting a half-hour, and 
they look different for different households. 
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Fig. 5.2 Households-half-hourly usage box-plot 
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(c) ACF and partial ACF for household 33. (d) ACF and partial ACF for household 220. 


Fig. 5.3 Exploratory data analysis of household profiles 


5.1.1 Short Term Load Forecasts 


In this section, several different popular forecasting algorithms from both statistical 
and machine learning backgrounds will be tested. We will evaluate them using four 
error measures described in Sect. 2.2, MAPE, MAE, MAD and £4. 

Since we want to compare errors for different forecasting algorithms, in Chap. 2 
we have established two simple benchmarks. A last week (LW) forecast, where the 
data from one week before is used to predict the same half-hour of the test week is 
extremely simple (as no calculation is needed), but relatively competitive. A simple 
average over several past weeks is also popular, a so called similar day (SD) forecast. 

Deciding on how many weeks of history to use to average over is not always 
straightforward, especially when seasons change. Here we have done a quick opti- 
misation of the number of weeks used. Although the smallest error is obtained for 
one week, i.e. when SD is the same to LW forecast, we use four weeks of history, as 
this resulted in the smallest 4th norm error, and we are interested in peaks. Examples 
of LW and SD forecasts are given on Figs. 5.4 and 5.5. 

In addition to the two benchmarks, LW and SD, four different algorithms: 
SARIMA (seasonal ARIMA), Permutation merge (PM), LSTM (recurrent neural 
network) and MLP (forward neural network) are compared. The detailed descrip- 
tions of the algorithms are given in Chap. 2. 
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Fig. 5.4 The LW forecast (red cros) and observation (blue dot) for one household 
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Fig. 5.5 The SD forecast (red cros) and observation (blue dot) for one household 


5.1.1.1 SARIMA 


As previously discussed, this is a variation of a widely used ARIMA model, where 
the past values are used to predict future, but also moving average helps to pick up 
changes in the observations. Integration ensures stationarity of the data. In seasonal 
autoregressive integrated moving average model (SARIMA) , seasonal part is added. 
In our case, that is the detected daily seasonality. The time series is split into peak load 
and seasonal part. The general peak load is assumed to be without and seasonal part 
is with periodicity. The parameters we use are p = {2, 3,5, 6}, d = 1, q = {0, 1, 3} 
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Fig. 5.6 The SARIMA forecast (red cros) and observation (blue dot) for one household 


for the general part and P = (1, 2), D = 0, Q = {0, 1, 2} for the seasonal part of the 
model. The parameters were obtained doing localised search based on the Akaike 
Information Criterion (AIC) for each household.? An example showing some success 
with the prediction of peaks can be seen on Fig. 5.6. 


5.1.1.2 PM 


The algorithm with the size of the window 1, therefore allowing permutations with 
one half hour before and after, was run for a different number of weeks in history. 
When using only one week, this is equal to LW benchmark. As shown on Fig. 5.7, 
there was no single value that optimised all four errors. We have chosen 4 weeks of 
history based on the smallest E4 error. While relatively similar to SD forecast, PM 
manages to capture some peak timings better (compare Fig. 5.5 with Fig. 5.8). 


5.1.1.3 LSTM 


The Long Short Term Memory, a recurrent neural network method with two hidden 
layers with 20 nodes each was used, implemented with Python Keras library [1], 
training the model over 5 epochs and using a batch size of 48 (the number of sam- 
ples per gradient update). The optimiser used was ‘adam’, a method for first-order 
gradient-based optimization of stochastic objective functions, based on adaptive esti- 
mates of lower-order moments [2]. For the input, we have used the previous load, 
half-hour and day of the week values. This was coded by 4 values: 0 for working 


3We used Pmdarima Python library, and its auto_arima functionality. 
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Fig. 5.7 PM algorithms performance for different mean error values 
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Fig. 5.8 The PM4 forecast (red cros) and observation (blue dot) for one household 


day followed by working day; 1 for working day followed by non-working day; 2 
for non-working day followed by working day; 3 for non-working day followed by 


non-working day. 


We ran limited parameter search from 10 to 30 nodes in each layer, and noticed, 
similarto [3] that the equal number of nodes per layer seem to work best. The minimal 
errors for all four error measures were obtained for the configuration with 20 nodes 
in each hidden layer, which agrees with the optimal choice obtained by [4], based 
on MAPE error only. An example of LSTM forecast can be seen on Fig. 5.9. 
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Fig. 5.9 The LSTM forecast (red cros) and observation (blue dot) for one household 


5.1.1.4 MLP 


Multi-layer perceptron, a feed-forward neural network with five hidden layers and 
nine nodes in each was chosen, after running limited parameter search, firstly deciding 
on the number of nodes in two layers (from 5 to 20) and then adding layers with the 
optimal number of neurons, 9, until the errors started to grow. All four error measures 
were behaving the same way. 
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Fig. 5.10 The MLP forecast (red cros) and observation (blue dot) for one household 
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We used MLPRegressor from Python scikit-learn library [5], using ‘relu’, the 
rectified linear unit function, f(x) = max(0, x) as the activation function for the 
hidden layers. An optimiser in the family of quasi-Newton methods, 'Ibfgs', was 
used as the solver for weight optimisation. The learning rate for weight updates was 
set to ‘adaptive’, i.e. kept constant on 0.01, as long as training loss was continuing to 
decrease. Each time two consecutive epochs fail to decrease training loss by at least 
0.0001, or fail to increase validation score by at least 0.0001, the learning rate was 
divided by 5. L2 penalty was set to a = 0.01. An example is shown on Fig. 5.10, 
where timing of regular peaks is mostly captured, but amplitudes are underestimated. 


5.1.2 Forecast Uncertainty 


In Tables 5.1, 5.2, 5.3, respectively, means, medians and maxima over households of 
all four errors are given for the four algorithms and two benchmarks. The box-plot of 
four errors means across 226 households is given on Fig. 5.11. The results show that 
SARIMA forecast is having the smallest errors for E4 error measure and performing 
best with respect to peaks. Two benchmarks are very competitive, when looking 
across all the values, with LW doing very well in all other three error measures. PM 
and MLP are slightly worse and LSTM is lagging behind. 

While four error measures give values that are all positive, the differences between 
predicted and actual value can be negative, in the case of underestimation. This 
is of importance, especially for peaks. The consequences of underestimated peaks 


Table 5.1 Mean errors 


Error LW SD PM SARIMA |LSTM MLP 
MAPE (%) |41.8924 [44.3813 |45.1576 43.3889  |47.1202 [47.7836 
MAE 0.0836 0.0844 0.0861 0.0866 0.0957 0.0902 
(kWh) 

MAD 0.0329 0.04314 | 0.0440 0.0450 0.0512 0.0538 
(kWh) 

E4 (kWh) | 1.2299 1.0588 1.0774 1.0524 1.2307 1.0924 


Table 5.2 Median errors 


Error LW SD PM SARIMA |LSTM MLP 
MAPE (%) |42.7588 | 42.3669 [43.1196 44.0777 |45.2884  |43.8456 
MAE 0.0742 0.0737 0.0758 0.0757 0.0792 0.0754 
(kWh) 

MAD 0.0278 0.0363 0.0375 0.0387 0.0445 0.0454 
(kWh) 

E4 (kWh) | 1.1785 0.9792 0.9916 1.0127 1.1598 1.0183 
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Table 5.3 Maximum errors 


Error LW SD PM SARIMA |LSTM MLP 
MAPE (96) | 150.3073 | 463.8288 — |459.1680 | 145.5191 | 343.5002 [681.7654 
MAE 0.2688 0.4831 0.4823 0.3111 0.7156 0.5721 
(kWh) 

MAD 0.1365 0.1680 0.1664 0.1700 0.1742 0.1799 
(kWh) 

E4 (kWh) 3.6269 5.7239 5.8357 3.0036 8.2499 6.4193 
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Fig. 5.11 Average errors across households 


(higher prices, outages, or storage control problems) are usually much worse than 
overestimation of peaks (higher costs, or non-optimal running). Histograms of those 
distances for all used methods are given on Fig. 5.12 with normal probability distri- 
bution function contours, based on the distances’ mean and standard deviation. One 
can see that these distances are not normally distributed. Almost all forecasts are 
more one-sided, therefore underestimating. This is especially pronounced for SD, 
PM, LSTM and MLP forecasts. Also, one can notice similarity in distances profiles 
between LW, and SARIMA on one hand and between other four forecast on the other 
hand. 

We note that this predictive task is quite challenging. In the week commencing 
31 Aug 2015, there is a double challenge of a summer bank holiday (31 Aug) and 
beginning of school year, while summer weeks before that in the UK in general 
are characterised by less consumption. This leaves only one full week of behaviour 
relatively similar to the week that we want to predict which explains why LW's 
MAPE is on average better than more sophisticated methods. 
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Fig. 5.12 Histogram of errors—differences between predicted and observed values 


5.1.3 Heteroscedasticity in Forecasts 


In this section, we want to look into timing and frequency of largest errors for all 
forecast methods that were compared in the previous section. We want to see if 
we can spot any patterns. Are the different methods better in different time-steps? 
Can we identify time periods that are more difficult for forecasting? To this end, 
we apply the development of the scedasis introduced in Sect. 4.4 to capture how the 
largest absolute errors stemming from each forecasting method evolve over time. The 
interpretation is the following: the higher the scedasis at time f € [0, 1], the higher 
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the propensity for extreme errors to occur at that time f. A value around 1 means 
stationarity in the errors of forecasts. 

Figure 5.13 displays the estimated scedasis, as given in (4.22), when we select the 
largest 50 errors determined by each forecasting method. The SARIMA model yield 
the least oscillation around 1 which is indicative of satisfactory performance of this 
time series model in capturing the relevant traits in the data. Both PM4 and SD4 seem 
to have better predictive power on later days of the week as they exhibit a decreasing 
trend in the likelihood of large errors. All methods show large uncertainty in the 
forecasts delivered between Wednesday and Thursday, where all the sample paths 
for the scedasis tend to concentrate above 1. The early hours of Thursday maxima 
box-plots on Fig.5.1c when compared with Fig.5.1a show more spread in values. 
In this way, the estimated scedasis values give us a way to quantify which times are 
more difficult for prediction regarding different algorithms. 
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Fig. 5.13 Estimated scedasis, ĉ, as a function of time 
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